Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Well, your code kind of already tells you what's going on. In your lambda you are only grabbing absolute links that start with http:// (which you are not grabbing https FWIW). You should grab all of the links and check to see if they start with http+ or not. If they don't, then they are a relative link, and since you know what the <code>current_page</code> is then you can use that to create an absolute link.</p> <p>Here's a modification to your code. Excuse my Python as it's a little rusty, but I ran it and it worked in Python 2.7 for me. You'll want to clean it up and add some edge/error detection, but you get the gist:</p> <pre><code>#!/usr/bin/python from bs4 import BeautifulSoup import urllib2 import itertools import random import urlparse class Crawler(object): """docstring for Crawler""" def __init__(self): self.soup = None # Beautiful Soup object self.current_page = "http://www.python.org/" # Current page's address self.links = set() # Queue with every links fetched self.visited_links = set() self.counter = 0 # Simple counter for debug purpose def open(self): # Open url print self.counter , ":", self.current_page res = urllib2.urlopen(self.current_page) html_code = res.read() self.visited_links.add(self.current_page) # Fetch every links self.soup = BeautifulSoup(html_code) page_links = [] try : for link in [h.get('href') for h in self.soup.find_all('a')]: print "Found link: '" + link + "'" if link.startswith('http'): page_links.append(link) print "Adding link" + link + "\n" elif link.startswith('/'): parts = urlparse.urlparse(self.current_page) page_links.append(parts.scheme + '://' + parts.netloc + link) print "Adding link " + parts.scheme + '://' + parts.netloc + link + "\n" else: page_links.append(self.current_page+link) print "Adding link " + self.current_page+link + "\n" except Exception, ex: # Magnificent exception handling print ex # Update links self.links = self.links.union( set(page_links) ) # Choose a random url from non-visited set self.current_page = random.sample( self.links.difference(self.visited_links),1)[0] self.counter+=1 def run(self): # Crawl 3 webpages (or stop if all url has been fetched) while len(self.visited_links) &lt; 3 or (self.visited_links == self.links): self.open() for link in self.links: print link if __name__ == '__main__': C = Crawler() C.run() </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload