12
How to crawl a quarter billion webpages in 40 hours michaelnielsen.org
in Technology

More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. Continue Reading

15 minute read

12 upvotes

Get more things like this direct to your inbox.

6 comments

Signup to comment
Sort by: Votes (high to low)