Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Take a look at the very capable <a href="https://github.com/dbalatero/typhoeus" rel="nofollow">Typhoeus and Hydra</a> combo. The two make it very easy to concurrently process multiple URLs. </p> <p>The "<a href="https://github.com/dbalatero/typhoeus/blob/master/examples/times.rb" rel="nofollow">Times</a>" example should get you up and running quickly. In the <code>on_complete</code> block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.</p> <p>Paul Dix, the original author, <a href="http://www.pauldix.net/2009/05/breath-fire-over-http-in-ruby-with-typhoeus.html" rel="nofollow">talked about his design goals</a> on his blog.</p> <p>This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:</p> <pre><code>#!/usr/bin/env ruby require 'nokogiri' require 'addressable/uri' require 'typhoeus' BASE_URL = '' url = Addressable::URI.parse(BASE_URL) resp = Typhoeus::Request.get(url.to_s) doc = Nokogiri::HTML(resp.body) hydra = Typhoeus::Hydra.new(:max_concurrency =&gt; 10) doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip| gzip_url = url.join(gzip) request = Typhoeus::Request.new(gzip_url.to_s) request.on_complete do |resp| gzip_filename = resp.request.url.split('/').last puts "writing #{gzip_filename}" File.open("gz/#{gzip_filename}", 'w') do |fo| fo.write resp.body end end puts "queuing #{ gzip }" hydra.queue(request) end hydra.run </code></pre> <p>Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing <code>HEAD</code> requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.</p> <p>I give it a 8 out of 10; It's got a great beat and I can dance to it.</p> <hr> <p>EDIT:</p> <p>When checking the remove URLs you can use a HEAD request, or a GET with the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25" rel="nofollow"><code>If-Modified-Since</code></a>. They can give you responses you can use to determine the freshness of your URLs.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload