Note that there are some explanatory texts on larger screens.

plurals
  1. POPython urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems
    primarykey
    data
    text
    <p>I've written a crawler that uses urllib2 to fetch urls.</p> <p>every few requests I get some weird behaviors, I've tried analyzing it with wireshark and couldn't understand the problem.</p> <p>getPAGE() is responsible for fetching the url. it returns the content of the url (response.read()) if it successfully fetches the url, else it returns None.</p> <pre><code>def getPAGE(FetchAddress): attempts = 0 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'} while attempts &lt; 2: req = Request(FetchAddress, None ,headers) try: response = urlopen(req) #fetching the url except HTTPError, e: print 'The server didn\'t do the request.' print 'Error code: ', str(e.code) + " address: " + FetchAddress time.sleep(4) attempts += 1 except URLError, e: print 'Failed to reach the server.' print 'Reason: ', str(e.reason) + " address: " + FetchAddress time.sleep(4) attempts += 1 except Exception, e: print 'Something bad happened in gatPAGE.' print 'Reason: ', str(e.reason) + " address: " + FetchAddress time.sleep(4) attempts += 1 else: return response.read() return None </code></pre> <p>this is the function that calls getPAGE() and checks if the the page I've fetched is valid (checking with - <em>companyID = soup.find('span',id='lblCompanyNumber').string</em> #if companyID is None the page is not valid), if the page is valid it saves the soup object to a global variable named 'curRes'.</p> <pre><code>def isValid(ID): global curRes try: address = urlPath+str(ID) page = getPAGE(address) if page == None: saveToCsv(ID, badRequest = True) return False except Exception, e: print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address else: try: soup = BeautifulSoup(page) except TypeError, e: print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address return False try: companyID = soup.find('span',id='lblCompanyNumber').string if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file saveToCsv(ID, isEmpty = True) return False else: curRes = soup #we have the data we need, save the soup obj to a global variable return True except Exception, e: print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address return False </code></pre> <p>the strange behaviors are - </p> <ol> <li>there are times that urllib2 executes a GET request and without waiting for the reply it sends the next GET request (ignoring the last request)</li> <li>sometimes I get "<em>[errno 10054] An existing connection was forcibly closed by the remote host"</em> after the code is simply stuck for about 20 minutes or so waiting for a response from the server, while it stucks I copy the url and try to fetch it manually and I get a response in less then 1 sec (?).</li> <li>getPAGE() function will return None to isValid() if it failed to return the url, sometimes I get the Error - </li> </ol> <blockquote> <p>Error while parsing this page, third exception block: 'NoneType' object has no attribute 'string' id:....</p> </blockquote> <p>that's weird because I'm creating the soup object just if I got a valid result from getPAGE(), and it seems that the soup function is returning None, which is raising an exception whenever I try to run </p> <blockquote> <p>companyID = soup.find('span',id='lblCompanyNumber').string</p> </blockquote> <p>the soup object should never be None, it <em>should</em> get the HTML from getPAGE() if it reaches that part of the code</p> <p>I've checked and saw that the problem is somehow connected to the first problem (sending GET and not waiting for the reply, I saw (on WireShark) that each time I got that exception it was for a url that urllib2 sent a GET request but didn't wait for the response and moved on, getPAGE() should have returned None for that url, but if it would return None isValid(ID) wouldn't pass the "if page == None:" condition, I can't find out why it is happening, it's impossible to replicate the issue.</p> <p>I've read that time.sleep() can cause <a href="http://homepage.mac.com/s_lott/iblog/architecture/C551260341/E20081031204203/index.html" rel="nofollow">issues with urllib2 threading</a>, so maybe I should avoid using it?</p> <p>why doesn't urllib2 always wait for the response (it happens rarely that it doesn't wait)?</p> <p>what can I do about the <em>"[errno 10054] An existing connection was forcibly closed by the remote host"</em> Error? BTW - the exception isn't caught by getPAGE() try: except block, it is caught by the first isValid() try: except: block, which is also weird cause getPAGE() suppose to catch all the exceptions it throws. </p> <pre><code>try: address = urlPath+str(ID) page = getPAGE(address) if page == None: saveToCsv(ID, badRequest = True) return False except Exception, e: print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address </code></pre> <p>Thanks!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload