Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Ask yourself, is there any advantage (to you) in being able to <em>access</em> your web crawler via web requests. If not, there is no reason to put it in a web container.</p> <hr> <blockquote> <p>... but I want to be constantly cycling through those sites (24 hours) to make sure that I have the latest content.</p> </blockquote> <p>I hope you have the consent / permission of the site owners to do this. Otherwise, they are likely to take technical or legal measures to stop you doing this.</p> <p>As Danny Thomas says, your crawler <em>should</em> implement a "robots.txt" handler, and respect what these files say when crawling.</p> <hr> <p><strong>FOLLOWUP</strong></p> <blockquote> <p>I may not visit the same page again for at least another 10-15 hours because of the number of sites I need visit. Is that still generally considered too much crawling?</p> </blockquote> <p>That's not the right question to ask. The right question to ask is whether the specific site owners would consider that to be too much crawling.</p> <ul> <li><p>How much is it costing them? Do they needs to do extra work to deal with the load caused by your crawling? Do they need to add capacity? Does it increase their running costs? (Network charges, electricity?)</p></li> <li><p>Are you doing something with their content that could reduce their income; e.g. reduce the number of real hits on their site, the number of advert click-throughs?</p></li> <li><p>What benefit do they gain from your crawling?</p></li> <li><p>Is what you are doing for the public good? (Or is it just a way for you to make a buck out of their content?)</p></li> </ul> <p>The only way to really know is to <strong>ask them</strong>.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload