Note that there are some explanatory texts on larger screens.

plurals
  1. POImproving BeautifulSoup Perf
    primarykey
    data
    text
    <p>SO I have the following set of code parsing delicious information. It prints data from a Delicious page in the following format</p> <p>Bookmark | Number of People</p> <p>Bookmark | Number of People etc...</p> <p>I used to use the following method to find this info. </p> <pre><code>def extract (soup): links = soup.findAll('a',rel='nofollow') for link in links: print &gt;&gt; outfile, link['href'] hits = soup.findAll('span', attrs={'class': 'delNavCount'}) for hit in hits: print &gt;&gt; outfile, hit.contents #File to export data to outfile = open("output.txt", "w") #Browser Agent br = Browser() br.set_handle_robots(False) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] url= "http://www.delicious.com/asd" page = br.open(url) html = page.read() soup = BeautifulSoup(html) extract(soup) </code></pre> <p>But the problem was that some bookmarks didnt have a number of people, so I decided to parse it different so that I would get the data concurrently and print out the bookmarks and number of people side by side. </p> <p>EDIT: Got it from 15 - 5 seconds with this updated version, any more suggestions</p> <pre><code>def extract (soup): bookmarkset = soup.findAll('div', 'data') for bookmark in bookmarkset: link = bookmark.find('a',) vote = bookmark.find('span', 'delNavCount') try: print &gt;&gt; outfile, link['href'], " | " ,vote.contents except: print &gt;&gt; outfile, "[u'0']" #print bookmarkset #File to export data to outfile = open("output.txt", "w") #Browser Agent br = Browser() br.set_handle_robots(False) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] url= "http://www.delicious.com/asd" page = br.open(url) html = page.read() soup = BeautifulSoup(html) extract(soup) </code></pre> <p>The performance on this is terrible though, takes 17 secs to parse the first page, and around 15 secs thereafter on a pretty decent machine. It significantly degraded when going from the first bit of code to the second bit. Is there anything I can do to imporve perf here?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload