12 3

How to crawl a quarter billion webpages in 40 hours michaelnielsen.org

in Technology 740 views

More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. Continue Reading

15 minute read

Get more things like this direct to your inbox.

Signup to comment