Note that there are some explanatory texts on larger screens.

plurals
  1. POPython Crawler - need help with my algorithm
    primarykey
    data
    text
    <p>** <strong>Added a summary of the problem at the end of the post</strong> **</p> <p>I've written a crawler that fetches and parses URLs.</p> <p>in the first version, in order to get the next valid page I was increasing the URL ID and saved non-valid IDs to a file, the valid URLs were moved to a parser that parsed the content I need, after a while I noticed that most valid IDs has a returning subtrahend.</p> <p>I made some statistics and got to that list of subtrahends - [8,18,7,17,6,16,5,15], ordered by the most repeating to the least.</p> <p>so I <a href="https://stackoverflow.com/questions/6809402/python-maximum-recursion-depth-exceeded-while-calling-a-python-object">changed my code</a> to - </p> <pre><code>def checkNextID(ID): numOfRuns = 0 global curRes, lastResult while ID &lt; lastResult: try: numOfRuns += 1 if numOfRuns % 10 == 0: time.sleep(7) # sleep every 10 iterations numOfRuns = 0 if isValid(ID + 8): parseHTML(curRes, ID) ID = ID + 8 elif isValid(ID + 18): parseHTML(curRes, ID) ID = ID + 18 elif isValid(ID + 7): parseHTML(curRes, ID) ID = ID + 7 elif isValid(ID + 17): parseHTML(curRes, ID) ID = ID + 17 elif isValid(ID+6): parseHTML(curRes, ID) ID = ID + 6 elif isValid(ID + 16): parseHTML(curRes, ID) ID = ID + 16 elif isValid(ID + 5): parseHTML(curRes, ID) ID = ID + 5 elif isValid(ID + 15): parseHTML(curRes, ID) ID = ID + 15 else: if isValid(ID + 1): parseHTML(curRes, ID) ID = ID + 1 except Exception, e: print "something went wrong: " + str(e) </code></pre> <p><strong>isValid()</strong> is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global variable named - 'curRes', it returns False if the url doesn't contain the data I need and saves the id to 'badIdFile'.</p> <p><strong>parseHTML</strong> is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.</p> <p>in a perfect world that piece of code would be everything I need in order to run on all valid IDs (there are about 400K in a range of 5M), it gave me better results in less time (x50 faster).</p> <p>BUT, when getting to ranges that doesn't contain any valid URL, my code is very inefficient, it turns that I'm crawling mostly the same URLs over and over in each iteration, this is because I'm increasing the ID by one in order to keep progressing until I'll find the next valid URL and then I'm checking the ID + 8, then 18, 17 etc' which sometimes gives me the same URLs I checked in the previous iteration.</p> <p>so I went and changed the code so it will keep a a set of non-valid URLs which I'll avoid checking again and I can't get it to work, I'm breaking my head for a few hours now, it's not working like it should.</p> <p>this is my new function - </p> <pre><code>def checkNextID(ID): runs = 0 badTries = set() diff = [8,18,7,17,6,16,5,15,1] global curRes, lastResult while ID &lt; lastResult: if len(badTries) == 100: #every 100 non-valid IDs I can reset badTries(I think), if I've increased the ID 100 times it means I won't be checking previous IDs and I can start saving the non-valid IDs again. badTries = set() try: for i in diff: tryID = ID + i #tryID will be the ID + one of the subtrahends. if tryID not in badTries: #if we haven't already tried that ID if runs % 10 == 0: time.sleep(6) #sleep every 10 requests if isValid(tryID): runs += 1 parseHTML(curRes, ID) ID = tryID badTries = set() #I can reset the badTries now, I'm moving forward. break #will move to the next id. else: badTries.add(ID) #save the non-valid ID to the set if i == 1: #it's the last iteration of the for loop, if the subtrahend is 1 and it's not valid I must increase the ID by 1 in order to go forward. ID += 1 else: ID += 1 #if the ID is not a valid ID and I've already checked it I must increase the ID by one or I'll get into an infinite loop. except Exception, e: print "something went wrong: " + str(e) + ' ID - ' + str(ID) </code></pre> <p>I'm saving each non-valid ID to a set, and before every call to isValid() I'm checking if I've already tried that ID, if I didn't, I'm calling isValid(), else, ID is increased by one. </p> <p>that's how the bad ID file looks like - </p> <pre><code>513025328 513025317 513025327 513025316 513025326 513025312 513025320 513025330 513025319 513025329 513025318 513025328 513025317 513025327 513025313 513025321 513025331 513025320 513025330 513025319 513025329 513025318 513025328 513025314 513025322 513025332 513025321 513025331 513025320 513025330 513025319 513025329 513025315 513025323 513025333 513025322 513025332 513025321 513025331 513025320 513025330 513025316 513025324 513025334 513025323 513025333 513025322 513025332 513025321 513025331 513025317 513025325 513025335 513025324 513025334 513025323 513025333 513025322 513025332 513025318 513025326 513025336 513025325 513025335 513025324 513025334 513025323 513025333 513025319 513025327 513025337 513025326 513025336 513025325 513025335 513025324 513025334 513025320 513025328 513025338 513025327 513025337 513025326 513025336 513025325 513025335 513025321 513025329 513025339 513025328 513025338 513025327 513025337 513025326 513025336 513025322 513025330 513025340 513025329 513025339 513025328 513025338 513025327 513025337 513025323 513025331 513025341 513025330 513025340 513025329 513025339 513025328 513025338 513025324 513025332 513025342 513025331 513025341 513025330 513025340 513025329 513025339 513025325 513025333 513025343 513025332 513025342 513025331 513025341 513025330 513025340 513025326 513025334 513025344 513025333 513025343 513025332 513025342 513025331 513025341 513025327 513025335 513025345 513025334 513025344 513025333 513025343 513025332 513025342 513025328 513025336 513025346 513025335 513025345 513025334 513025344 513025333 513025343 513025329 513025337 513025347 513025336 513025346 513025335 513025345 513025334 513025344 513025330 513025338 513025348 513025337 513025347 513025336 513025346 513025335 513025345 513025331 513025339 513025349 513025338 513025348 513025337 513025347 513025336 513025346 513025332 513025340 513025350 513025339 513025349 513025338 513025348 513025337 513025347 513025333 513025341 513025351 513025340 513025350 513025339 513025349 513025338 513025348 513025334 513025342 513025352 513025341 513025351 513025340 513025350 513025339 513025349 513025335 513025343 513025353 513025342 513025352 513025341 513025351 513025340 513025350 513025336 513025344 513025354 513025343 513025353 513025342 513025352 513025341 513025351 513025337 513025345 513025355 513025344 513025354 513025343 513025353 513025342 513025352 513025338 513025346 513025356 513025345 513025355 513025344 513025354 513025343 513025353 513025339 513025347 513025357 513025346 513025356 513025345 513025355 513025344 513025354 513025340 513025348 513025358 513025347 513025357 513025346 513025356 513025345 513025355 513025341 513025349 513025359 513025348 513025358 513025347 513025357 513025346 513025356 513025342 513025350 513025360 513025349 513025359 513025348 513025358 513025347 513025357 513025343 513025351 513025361 513025350 513025360 513025349 513025359 513025348 513025358 513025344 513025352 513025362 513025351 513025361 513025350 513025360 513025349 513025359 513025345 513025353 513025363 513025352 513025362 513025351 513025361 513025350 513025360 513025346 513025354 513025364 513025353 513025363 513025352 513025362 513025351 513025361 513025347 513025355 513025365 513025354 513025364 513025353 513025363 513025352 513025362 513025348 513025356 513025366 513025355 513025365 513025354 513025364 513025353 513025363 513025349 513025357 513025367 513025356 513025366 513025355 513025365 513025354 513025364 513025350 513025358 513025368 513025357 513025367 513025356 513025366 513025355 513025365 513025351 513025359 513025369 513025358 513025368 513025357 513025367 513025356 513025366 513025352 513025360 513025370 513025359 513025369 513025358 513025368 513025357 513025367 513025353 513025361 513025371 513025360 513025370 513025359 513025369 513025358 513025368 513025354 513025362 513025372 513025361 513025371 513025360 513025370 513025359 513025369 513025355 513025363 513025373 513025362 513025372 513025361 513025371 513025360 513025370 513025356 513025364 513025374 513025363 513025373 513025362 513025372 513025361 513025371 513025357 513025365 513025375 513025364 513025374 513025363 513025373 513025362 513025372 513025358 513025366 513025376 513025365 513025375 513025364 513025374 513025363 513025373 513025359 513025367 513025377 513025366 513025376 513025365 513025375 513025364 513025374 513025360 513025368 513025378 513025367 513025377 513025366 513025376 513025365 513025375 513025361 513025369 513025379 513025368 513025378 513025367 513025377 513025366 513025376 513025362 513025370 513025380 513025369 513025379 513025368 513025378 513025367 513025377 513025363 513025371 513025381 513025370 513025380 513025369 513025379 513025368 513025378 513025364 513025372 513025382 513025371 513025381 513025370 513025380 513025369 513025379 513025365 513025373 513025383 513025372 513025382 513025371 513025381 513025370 513025380 513025366 513025374 513025384 513025373 513025383 513025372 513025382 513025371 513025381 513025367 513025375 513025385 513025374 513025384 513025373 513025383 513025372 513025382 513025368 513025376 513025386 513025375 513025385 513025374 513025384 513025373 513025383 513025369 513025377 513025387 513025376 513025386 513025375 513025385 513025374 513025384 513025370 513025378 513025388 513025377 513025387 513025376 513025386 513025375 513025385 513025371 513025379 513025389 513025378 513025388 513025377 513025387 513025376 513025386 513025372 513025380 513025390 513025379 513025389 513025378 513025388 513025377 513025387 513025373 513025381 513025391 513025380 513025390 513025379 </code></pre> <p>as you can see, it's not working, I know I have a flaw in the whole design but I can't find it, I would really appreciate your help.</p> <p><strong>a summary of the problem</strong> - </p> <p>I have a diff list [8,18,7,17,6,16,5,15] the program starts with getting an id, each time I need to check the next id which is - id + diff[i] <em>(i=0)</em> if (id + diff[i]) isn't a valid id I'm checking the next id which is (id + diff[i+1]).</p> <p>if there isn't a valid id on that iteration (id + diff[i..n]) I'm <strong>increasing id by 1</strong>, and check if id+1 is valid id, if not I'm checking again with id + diff[i..n], until I'll find the next valid id.</p> <p>in each iteration I'm checking ids I've already checked in the previous iteration (which costs a lot of time and is inefficient), I need to avoid checking the IDs I've already checked and keep increasing the ID until I'll find the next valid ID.</p> <p>as for now, if the id = 1 (and it's a valid id) and diff = [8,18,7,17,6,16,5,15].</p> <p>first iteration will look like (I'm marking with bold the id's which I could avoid checking) - first - id = 1</p> <p>9, 19, 8, 18, 7, 17, 6, 16, 2</p> <p>second - id = 2</p> <p>10, 20, <strong>9</strong>, <strong>19</strong>, <strong>8</strong>, <strong>18</strong>, <strong>7</strong>, <strong>17</strong>, 3</p> <p>third - id = 3</p> <p>11, 21, <strong>10</strong>, <strong>20</strong>, <strong>9</strong>, <strong>19</strong>, <strong>8</strong>, <strong>18</strong>, 4</p> <p>fourth - id = 4</p> <p>12, <em>22</em> - <strong>BINGO, next valid ID is 22!</strong></p> <p>that cost me 29 requests, instead of - 17, and that's a small example, I have ranges that are 300-600 ids from the last valid id.</p> <p>I can't get my code to avoid checking previously checked ids with a smart and efficient way.</p> <p>Thanks! </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload