StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>I used httplib2 for this kind of thing because there are supposed to be memory leaks in the Python standard library routines. Also, httplib2 can be configured to keep a cache which might be useful if you have to restart and redo some pages.</p> <p>I only ran through 1.7 million pages plus about 200000 from another server, so I can't comment on the volume you expect.</p> <p>But I drove it all using AMQP with a topic exchange and persistent message queues (delivery_mode=2). This fed ny ids into the worker that used httplib2 and made sure that every id was retrieved. I tracked them using a memcache that was persisted using Tokyo Tyrant hash table on disk. I was able to shut down and restart the workers and move them between machines without missing any ids. I've had the worker running for up two three weeks at a time before I killed it to tinker with it.</p> <p>Also, I used lxml for parsing responses because it is fast.</p> <p>Oh, and after a page was retrieved and processed successfully, I posted the id as a message to a completed queue. Then later I manual copied the messages off of that queue and compared it to the input list to make sure that the whole process was reliable.</p> <p>For AMQP I used amqplib with RabbitMQ as the broker. Nowadays I would recommend taking a look at haigha for AMQP. Although its documentation is sparse its model closely follows the AMQP 0.9.1 spec documents so you can use those to figure out options etc.</p> <p>@YSY: I can't cut and paste the code because I did it at work, however it was nothing special. Just a loop with try: except: wrapped around the http request. Something like this:</p> <pre><code>retries = 5 while retries > 0: requestSucceeded = True # assume the best try: resp, content = h.request("http://www.example.com/db/1234567") if resp is None: requestSucceeded = False log.warn ("1234567: no http response") elif resp.status != 200: requestSucceeded = False log.warn ("1234567: replied with {0:d}".format(resp.status)) except Exception as e: requestSuceeded = False log.warn("1234567: exception - " + str(e)) if not requestSucceeded: time.sleep(30) retries -= 1 else: retries = 0 if requestSucceded: process_request() ack_message() </code></pre> <p>The llop deals with two types of failures, one where the HTTP server talks to us but does not return a reply, and one where there is an exception, maybe a network error or anything else. You could be more sophisticated and handle different failure conditions in different ways. But this generally works. Tweak the sleep time and retries until you get over 90% success rate, then handle the rest later. I believe I'm using half hour sleeps and 3 retries right now, or maybe it is 15 minute sleeps. Not important really.</p> <p>After a full run through, I process the results (log, and the list of completed messages) to make sure that they agree, and any documents that failed to retrieve, I try again another day before giving up. Of course, I would scan through the logs looking for similar problems and tweaking my code to deal with them if I could think of a way.</p> <p>Or you could google "scrapy". That might work for you. Personally, I like using AMQP to control the whole process.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload