Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Going by your posted example:</p> <pre><code>import urllib2 from bs4 import BeautifulSoup url = "http://www.example.com/downlaod" site = urllib2.urlopen(url) html = site.read() soup = BeautifulSoup(html) list_urls = soup.find_all('a') print list_urls[6] </code></pre> <p>So, the URL you want to fetch next is presumably <code>list_urls[6]['href']</code>.</p> <p>The first trick is that this might be a relative URL rather than absolute. So:</p> <pre><code>newurl = list_urls[6]['href'] absurl = urlparse.urljoin(site.url, newurl) </code></pre> <p>Also, you want to only fetch the file if it has the right extension, so:</p> <pre><code>if not absurl.endswith(extensions): return # or break or whatever </code></pre> <p>But once you've decided what URL you want to download, it's no harder than your initial fetch:</p> <pre><code>page = urllib2.urlopen(absurl) html = page.read() path = urlparse.urlparse(absurl).path name = os.path.basename(path) with open(name, 'wb') as f: f.write(html) </code></pre> <p>That's mostly it.</p> <p>There are a few things you might want to add, but if so, you have to add them all manually. For example:</p> <ul> <li>Look for a Content-disposition header with a suggested filename to use in place of the URL's basename.</li> <li><code>copyfile</code> from <code>page</code> to <code>f</code> instead of <code>read</code>ing the whole thing into memory and then <code>write</code>ing it out.</li> <li>Deal with existing files with the same name.</li> <li>…</li> </ul> <p>But that's the basics.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload