StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems
primarykey
Id
6821109
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
6
CommunityOwnedDate
CreationDate
2011-07-25T19:15:02.060
FavoriteCount
1
LastActivityDate
2011-07-25T19:15:02.060
LastEditDate
LastEditorUserId
0
OwnerUserId
825597
ParentId
0
PostTypeId
1
Score
4
ViewCount
7791
LastEditorDisplayName
text
Body
I've written a crawler that uses urllib2 to fetch urls. every few requests I get some weird behaviors, I've tried analyzing it with wireshark and couldn't understand the problem. getPAGE() is responsible for fetching the url. it returns the content of the url (response.read()) if it successfully fetches the url, else it returns None. <pre><code>def getPAGE(FetchAddress): attempts = 0 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'} while attempts < 2: req = Request(FetchAddress, None ,headers) try: response = urlopen(req) #fetching the url except HTTPError, e: print 'The server didn\'t do the request.' print 'Error code: ', str(e.code) + " address: " + FetchAddress time.sleep(4) attempts += 1 except URLError, e: print 'Failed to reach the server.' print 'Reason: ', str(e.reason) + " address: " + FetchAddress time.sleep(4) attempts += 1 except Exception, e: print 'Something bad happened in gatPAGE.' print 'Reason: ', str(e.reason) + " address: " + FetchAddress time.sleep(4) attempts += 1 else: return response.read() return None </code></pre> this is the function that calls getPAGE() and checks if the the page I've fetched is valid (checking with - companyID = soup.find('span',id='lblCompanyNumber').string #if companyID is None the page is not valid), if the page is valid it saves the soup object to a global variable named 'curRes'. <pre><code>def isValid(ID): global curRes try: address = urlPath+str(ID) page = getPAGE(address) if page == None: saveToCsv(ID, badRequest = True) return False except Exception, e: print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address else: try: soup = BeautifulSoup(page) except TypeError, e: print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address return False try: companyID = soup.find('span',id='lblCompanyNumber').string if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file saveToCsv(ID, isEmpty = True) return False else: curRes = soup #we have the data we need, save the soup obj to a global variable return True except Exception, e: print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address return False </code></pre> the strange behaviors are - <ol> <li>there are times that urllib2 executes a GET request and without waiting for the reply it sends the next GET request (ignoring the last request)</li> <li>sometimes I get "[errno 10054] An existing connection was forcibly closed by the remote host" after the code is simply stuck for about 20 minutes or so waiting for a response from the server, while it stucks I copy the url and try to fetch it manually and I get a response in less then 1 sec (?).</li> <li>getPAGE() function will return None to isValid() if it failed to return the url, sometimes I get the Error - </li> </ol> <blockquote> Error while parsing this page, third exception block: 'NoneType' object has no attribute 'string' id:.... </blockquote> that's weird because I'm creating the soup object just if I got a valid result from getPAGE(), and it seems that the soup function is returning None, which is raising an exception whenever I try to run <blockquote> companyID = soup.find('span',id='lblCompanyNumber').string </blockquote> the soup object should never be None, it should get the HTML from getPAGE() if it reaches that part of the code I've checked and saw that the problem is somehow connected to the first problem (sending GET and not waiting for the reply, I saw (on WireShark) that each time I got that exception it was for a url that urllib2 sent a GET request but didn't wait for the response and moved on, getPAGE() should have returned None for that url, but if it would return None isValid(ID) wouldn't pass the "if page == None:" condition, I can't find out why it is happening, it's impossible to replicate the issue. I've read that time.sleep() can cause <a href="http://homepage.mac.com/s_lott/iblog/architecture/C551260341/E20081031204203/index.html" rel="nofollow">issues with urllib2 threading</a>, so maybe I should avoid using it? why doesn't urllib2 always wait for the response (it happens rarely that it doesn't wait)? what can I do about the "[errno 10054] An existing connection was forcibly closed by the remote host" Error? BTW - the exception isn't caught by getPAGE() try: except block, it is caught by the first isValid() try: except: block, which is also weird cause getPAGE() suppose to catch all the exceptions it throws. <pre><code>try: address = urlPath+str(ID) page = getPAGE(address) if page == None: saveToCsv(ID, badRequest = True) return False except Exception, e: print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address </code></pre> Thanks!
Tags
<python><exception-handling><urllib2><web-crawler><errno>
Title
Python urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USYSY
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPython urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POPython urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.