Note that there are some explanatory texts on larger screens.

plurals
  1. POScrape internal links with Beautiful soup
    text
    copied!<p>I have written a python code to fetch the web-page corresponding to a given url, and parses all the links on that page into a repository of links. Next, it fetches the contents of any of the url from the repository just created, parses the links from this new content into the repository and continues this process for all links in the repository until stopped or after a given number of links are fetched.</p> <p>Here code:</p> <pre><code>import BeautifulSoup import urllib2 import itertools import random class Crawler(object): """docstring for Crawler""" def __init__(self): self.soup = None # Beautiful Soup object self.current_page = "http://www.python.org/" # Current page's address self.links = set() # Queue with every links fetched self.visited_links = set() self.counter = 0 # Simple counter for debug purpose def open(self): # Open url print self.counter , ":", self.current_page res = urllib2.urlopen(self.current_page) html_code = res.read() self.visited_links.add(self.current_page) # Fetch every links self.soup = BeautifulSoup.BeautifulSoup(html_code) page_links = [] try : page_links = itertools.ifilter( # Only deal with absolute links lambda href: 'http://' in href, ( a.get('href') for a in self.soup.findAll('a') ) ) except Exception: # Magnificent exception handling pass # Update links self.links = self.links.union( set(page_links) ) # Choose a random url from non-visited set self.current_page = random.sample( self.links.difference(self.visited_links),1)[0] self.counter+=1 def run(self): # Crawl 3 webpages (or stop if all url has been fetched) while len(self.visited_links) &lt; 3 or (self.visited_links == self.links): self.open() for link in self.links: print link if __name__ == '__main__': C = Crawler() C.run() </code></pre> <p>This code does not fetch internal links (only absolute formed hyperlinks)</p> <p><strong>How to fetch Internal links that starts with '/' or '#' or '.'</strong></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload