Note that there are some explanatory texts on larger screens.

plurals
  1. POImportError with gevent and requests async module
    primarykey
    data
    text
    <p>I'm writing a simple script that:</p> <ol> <li>Loads a big list of URLs</li> <li>Get the content of each URL making concurrent HTTP requests using <a href="http://docs.python-requests.org/en/latest/user/advanced/#asynchronous-requests" rel="nofollow">requests' async</a> module</li> <li>Parses the content of the page with lxml in order to check if a link is in the page</li> <li>If the link is present on the page, saves some info about the page in a ZODB database</li> </ol> <p>When I test the script with 4 or 5 URLs It works well, I only have the following message when the script ends:</p> <pre><code> Exception KeyError: KeyError(45989520,) in &lt;module 'threading' from '/usr/lib/python2.7/threading.pyc'&gt; ignored </code></pre> <p>But when I try to check about 24000 URLs it fails toward the end of the list (when there are about 400 URLs left to check) with the following error:</p> <pre><code>Traceback (most recent call last): File "check.py", line 95, in &lt;module&gt; File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/requests/async.py", line 83, in map File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/gevent-1.0b2-py2.7-linux-x86_64.egg/gevent/greenlet.py", line 405, in joinall ImportError: No module named queue Exception KeyError: KeyError(45989520,) in &lt;module 'threading' from '/usr/lib/python2.7/threading.pyc'&gt; ignored </code></pre> <p>I tried both with the version of gevent available on <a href="http://pypi.python.org/pypi/gevent" rel="nofollow">pypi</a> and downloading and installing the latest version (1.0b2) from <a href="http://code.google.com/p/gevent/downloads/list" rel="nofollow">gevent repository</a>.</p> <p>I cannot understand why this happened, and why it happened only when I check a bunch of URLs. Any suggestions? </p> <p>Here is the entire script:</p> <pre><code>from requests import async, defaults from lxml import html from urlparse import urlsplit from gevent import monkey from BeautifulSoup import UnicodeDammit from ZODB.FileStorage import FileStorage from ZODB.DB import DB import transaction import persistent import random storage = FileStorage('Data.fs') db = DB(storage) connection = db.open() root = connection.root() monkey.patch_all() defaults.defaults['base_headers']['User-Agent'] = "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0" defaults.defaults['max_retries'] = 10 def save_data(source, target, anchor): root[source] = persistent.mapping.PersistentMapping(dict(target=target, anchor=anchor)) transaction.commit() def decode_html(html_string): converted = UnicodeDammit(html_string, isHTML=True) if not converted.unicode: raise UnicodeDecodeError( "Failed to detect encoding, tried [%s]", ', '.join(converted.triedEncodings)) # print converted.originalEncoding return converted.unicode def find_link(html_doc, url): decoded = decode_html(html_doc) doc = html.document_fromstring(decoded.encode('utf-8')) for element, attribute, link, pos in doc.iterlinks(): if attribute == "href" and link.startswith('http'): netloc = urlsplit(link).netloc if "example.org" in netloc: return (url, link, element.text_content().strip()) else: return False def check(response): if response.status_code == 200: html_doc = response.content result = find_link(html_doc, response.url) if result: source, target, anchor = result # print "Source: %s" % source # print "Target: %s" % target # print "Anchor: %s" % anchor # print save_data(source, target, anchor) global todo todo = todo -1 print todo def load_urls(fname): with open(fname) as fh: urls = set([url.strip() for url in fh.readlines()]) urls = list(urls) random.shuffle(urls) return urls if __name__ == "__main__": urls = load_urls('urls.txt') rs = [] todo = len(urls) print "Ready to analyze %s pages" % len(urls) for url in urls: rs.append(async.get(url, hooks=dict(response=check), timeout=10.0)) responses = async.map(rs, size=100) print "DONE." </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload