Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I didn't come up with a magic algorithm when I wrote my web crawler. We used some heuristics that seemed to do a reasonably good job, although certainly not perfect.</p> <p>First, we looked at the site's robots.txt file. If it had a crawl-delay entry, we honored that by never exceeding it.</p> <p>For other servers, we would keep a running average of the time required for the last n requests (I think we settled on a value of 5), and we'd make sure that we never sent requests more frequently than that average. We measured time from when we made the request until we'd finished processing the response.</p> <p>If a server timed out, the time for that request would go into the running average.</p> <p>If we got a 50x from the server, we'd delay a fairly long time (five minutes or more) before making another request to that server. Repeated 50x responses would cause us to just stop making requests until somebody could go see what the problem was.</p> <p>We also kept track of the 40x responses. Lots of not found or access denied would cause the crawler to stop processing a domain and raise a flag so somebody could look at it.</p> <p>We had a distributed crawler. No individual crawler would make concurrent requests to the same domain, and we had some cross-server communication that made it unusual for multiple servers to make concurrent requests to the same domain.</p> <p>I'm sure that this didn't <em>maximize</em> throughput on any particular server, but it did keep the larger sites very busy. More importantly for us, it prevented us (mostly, anyway) from being blocked by many sites.</p> <p>We also had special-case handling for many sites with APIs. Some would say what their request limits were, and we'd adjust our settings for those sites so we rode right at the line. But we only had a few dozen of those. Manually configuring request frequency for 9,000 servers (and then keeping up with changes) would not be realistic. However, you might be able to manually configure the dozen or two.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload