Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I am not sure if I understood the question right, I am just curious, so here are few questions and suggestions:</p> <p>Are you planning to catch the Ctrl+C interrupt and do the deque? What happens if the crawler crashes for some arbitrary reason like an unhandled exception or crash? You loose the queue status and start over again? from the documentation:</p> <blockquote> <p>Note</p> <p>The exit function is not called when the program is killed by a signal, when a Python fatal internal error is detected, or when os._exit() is called.</p> </blockquote> <p>What happens when you happen to visit the same URI again, are you maintaining a visited list or something?</p> <p>I think you should be maintaining some kind of visit and session information / status for each URI you crawl. You can use the visit information to decide to crawl a URI or not when you visit the same URI next time. The other info - session information - for the last session with that URI will help in picking up only the incremental stuff and if the page is not change no need to pick it up saving some db I/O costs, duplicates, etc.</p> <p>That way you won't have to worry about the ctrl+C or a crash. If the crawler goes down for any reason, lets say after crawling 60K posts when 40K more were left, the next time crawler fills in the queue, though the queue may be huge but the crawler can check if the it has already visited the URI or not and what was the state of the page when it was crawled - optimization - does the page requires a new pick up coz it has changed or not.</p> <p>I hope that is of some help. </p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload