Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Stating an identifier as being <code>global</code> is done when one wants to provoke a modification concerning it when it is outside the function from which the modification is performed.</p> <p>Hence it is an aberration to make <strong>lastResult</strong> and <strong>curRes</strong> <code>global</code>:</p> <ul> <li><p>the first, <strong>lastResult</strong>, because it is a constant in your complete code. The best is to define a parameter <strong>lastResult</strong> of the function <strong>checkNextID()</strong> with the value of <strong>lastResult</strong> as default argument.</p></li> <li><p>the second, <strong>curRes</strong>, because there is no modification concerning this identifier in <strong>checkNextID()</strong> </p></li> </ul> <p>Now, defining <strong>curRes</strong> as <code>global</code> in the function <strong>isValid()</strong> is also a bad practice, though not an aberration: 1) a new value for <strong>curRes</strong> is sent from the inside of <strong>isValid()</strong> to the outside of it; 2) then, the program goes outside the function <strong>checkNextID()</strong> to search for the value of <strong>curRes</strong>. That's a weird and useless detour, you could let <strong>curRes</strong> be a <em>free variable</em> (see <a href="http://docs.python.org/reference/executionmodel.html" rel="nofollow">doc</a> ) in the function <strong>checkNextID()</strong> and this one will automatically go outside to resolve this name and to obtain its value. </p> <p>.</p> <p>Personally , I prefer to reorganise the general algorithm. In my following code, <strong>curRes</strong> is defined as a local object, taking directly its value from the return of the function <strong>isValid()</strong> . That requires to redefine <strong>isValid()</strong>: in my code <strong>isValid()</strong> returns the object <strong>soup</strong> or <strong>False</strong></p> <p>I hope I understood your need. Say me what's wrong in my approach, please.</p> <pre><code>def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]): runs = 0 maxdiff = max(diff) diff.extend(x for x in xrange(maxdiff) if x not in diff) while True: for i in diff: if ID+i==lastResult: break runs += 1 if runs % 10 == 0: time.sleep(6) curRes = isValid(ID+i): if cuRes: parseHTML(curRes, ID+i) ID = ID + i break else: runs += 1 ID += maxdiff + 1 if ID==lastResult: break def isValid(ID, urlhead = urlPath): # this function return either False OR a BeautifulSoup instance try: page = getPAGE(urlhead + str(ID)) if page == False: return False except Exception, e: print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address else: try: soup = BeautifulSoup(page) except TypeError, e: print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address return False try: companyID = soup.find('span',id='lblCompanyNumber').string if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file saveToCsv(ID, isEmpty = True) return False else: return soup #we have the data we need, save the soup obj to a global variable except Exception, e: print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address return False </code></pre> <p>.</p> <p>Also, to speed up your program:</p> <ul> <li><p>you should use <em>regex tool</em> (module <strong>re</strong>) instead of BeautifulSoup that is roughly 10 times slower than the use of a regex</p></li> <li><p>you shouldn't define and use all these functions in <strong>checkNextID</strong> (saveToCSV, parseHTML, isValid) : each call to a function takes an additional amount of time comparatively to a direct code</p></li> </ul> <p>.</p> <h2>Final Edit</h2> <p>To conclude this long study of your problem, I did a benchmark. Here after follow the codes and the results that show that my intuition is borne out: my code #2 takes at least 20 % less time to run than your code #1 . Your code #1:</p> <pre><code>from time import clock lastResult = 200 def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]): SEEN = set() li = [] while True: if ID&gt;lastResult: break if ID in SEEN: ID += 1 else: curRes = isValid(ID) if curRes: li.append(ID) while True: for i in diff: curRes = isValid(ID+i) if i==diff[0]: SEEN = set([ID+i]) else: SEEN.add(ID+i) if curRes: li.append(ID+i) ID += i break else: ID += 1 break else: ID += 1 return li def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)): return ID in valid_ones te = clock() for i in xrange(10000): checkNextID(0) print clock()-te,'seconds' print checkNextID(0) </code></pre> <p>My code #2</p> <pre><code>from time import clock lastResult = 200 def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]): maxdiff = max(diff) others = [x for x in xrange(1,maxdiff) if x not in diff] lastothers = others[-1] li = [] while True: if ID&gt;lastResult: break else: curRes = isValid(ID) if curRes: li.append(ID) while True: for i in diff: curRes = isValid(ID+i) if curRes: li.append(ID+i) ID += i break else: for j in others: if ID+j&gt;lastResult: ID += j break curRes = isValid(ID+j) if curRes: li.append(ID+j) ID += j break if j==lastothers: ID += maxdiff + 1 break elif ID&gt;lastResult: break else: ID += 1 return li def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)): return ID in valid_ones te = clock() for i in xrange(10000): checkNextID(0) print clock()-te,'seconds' print checkNextID(0) </code></pre> <p>Results:</p> <pre><code>your code 0.398804596674 seconds [1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200] my code 0.268061164198 seconds [1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200] </code></pre> <p>0.268061164198 / 0.398804596674 = 67.3 %</p> <p>I've tried also with lastResult = 100 , I got 72 % .<br> And with lastResult = 480, I got 80 %.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload