StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POLibrary or tool to download multiple files in parallel
text
Body
copied!<p>I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.</p> <p>My requirements for the solution are:</p> <ol> <li>Should be interruptable. Ctrl+C should immediately terminate all downloads.</li> <li>There should be no leftover processes that you have to kill manually using kill, even if the main program crashes or an exception is thrown.</li> <li>It should work on Linux and Windows too.</li> <li>It should retry downloads, be resilient against network errors and should timeout properly.</li> <li>It should be smart about not hammering the same server with 100+ simultaneous downloads, but queue them in a sane way.</li> <li>It should handle important http status codes like 301, 302 and 304. That means that for each file, it should take the Last-Modified value as input and only download if it has changed since last time. </li> <li>Preferably it should have a progress bar or it should be easy to write a progress bar for it to monitor the download progress of all files.</li> <li>Preferably it should take advantage of http keep-alive to maximize the transfer speed.</li> </ol> <p><strong>Please</strong> don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.</p> <p>I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously. </p> <p>Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload