Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>First of all, you should break up the job into two processes. One to determine valid ids, and the other to retrieve data.</p> <p>The one that determines valid ids only needs to use http HEAD commands and can work faster than the one that retrieves pages.</p> <p>For checking pages, after you check the id increments in <code>diff</code>, then add 18 to the id that caused you to start using <code>diff</code>. You can even record the ranges that were only partially checked by using <code>diff</code>, and come back later, at the end of the process, and check all of them as well.</p> <p>If you can't skip any ids, then keep a cache of the last n ids that were checked where n is equal to len(diff). Use a ring buffer something like this:</p> <pre><code>nextelem = 0 ... # check before retrieving if not id in ringbuff: #retrieve an id ringbuf[nextelem] = id nextelem += 1 if nextelem &gt; len[ringbuff]: nextelem = 0 </code></pre> <p>...</p> <p>On the surface of it, a simple loop like this should check all ids:</p> <pre><code>for id in xrange(1000000): checkpage(id) </code></pre> <p>This would check every possible page. But you want to read ahead when you get a hit, and also backtrack partially if I understand correctly. In any case, what you are fundamentally doing is changing the range of ids from the simple sequence returned by xrange, so I think you need to write a generator and do this instead:</p> <pre><code>for id in myrange(1000000): checkpage(id) </code></pre> <p>You might still want to use a ringbuffer, depending on what you do within that range of 18 possible additional hits. If you need to check all of the possibilities in <code>diff</code> and then go back to something less than the maximum element in diff, then the ring buffer would be useful in checkpage.</p> <p>But the trick is to write myrange().</p> <pre><code>def myrange(maxnum): global hitfound global nextnum global diff curnum = 0 while curnum &lt; maxnum: yield curnum if hitfound: nextnum = curnum hitnum = curnum for e in diff: yield hitnum + e curnum = nextnum - 1 curnum += 1 </code></pre> <p>The three global variables let you influence the range of ids. If you set <code>hitfound = True</code> inside <code>checkpage()</code> whenever you get a good page, then you influence myrange to start applying the increments in <code>diff</code>. Then, you can set <code>nextnum</code> to influence where it starts incrementing after you start applying the diff increments. For instance you might decide to set it to 1 greater than the first (or last) hit that you find while checking diff increments. Or you could leave it alone, and use the ring buffer to ensure that you don't request any of the diff increment pages again.</p> <p>I suggest that you extract the id incrementing logic and test it separately like my code above. Tweak the generator <code>myrange()</code> until it produces the right sequence, and then pop it into your web scraping program. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload