Note that there are some explanatory texts on larger screens.

plurals
  1. POPython, GTK, Webkit & scraping, big memory problem
    primarykey
    data
    text
    <p>I'm trying to mirror a website for content, unfortunately large chunks of it is javascript based, including code that generates hrefs. That wipes out most standard web scraping tools (like httrack) as their attempts at processing javascript, if they even do attempt it, are highly unreliable.</p> <p>So I decided to write my own in python and get the webkit engine to process the HTML. The procedural logic seems pretty straight forward, generate a dict with urls found as the key and the value being 0 or 1 based on whether or not it's already been processed. I managed to get the base logic to work reasonably well with pyqt4 but it kept segfaulting at random times, enough to make me distrust it, and then I found this: <a href="http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/" rel="nofollow">http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/</a></p> <p>Neat script, it works, but I've never dealt with gtk in python before. Bolting my logic around it was fairly straightforward though, however its proving to be a bit of a memory hog. Profiling it with meliae shows nothing taking up that much memory, even as python is reaching 2Gb. The site has a fair few pages and the script eventually reaches the 32bit memory limit and segfaults. My assumption is that the code is spawning more and more webkit windows. I'm at a loss how to get it to actually close or destroy those windows. I've tried destroy, there is a main_quit in there, nothing seems to be closing it.</p> <p>Here's what should be the relevant parts (I hope), but with destination url changed. I was using dicts for url and foundurl but switched to anydbm in case they were for some bizarre reason the memory hog. I'll probably be switching back to dicts at some point:</p> <pre><code>#!/usr/bin/env python import sys, thread import gtk import webkit import warnings from time import sleep from BeautifulSoup import BeautifulSoup import re import os import anydbm import copy from meliae import scanner warnings.filterwarnings('ignore') class WebView(webkit.WebView): def get_html(self): self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;') html = self.get_main_frame().get_title() self.execute_script('document.title=oldtitle;') self.destroy return html class Crawler(gtk.Window): def __init__(self, url, file): gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala gtk.Window.__init__(self) self._url = url self._file = file self.connect("destroy",gtk.main_quit) def crawl(self): view = WebView() view.open(self._url) view.connect('load-finished', self._finished_loading) self.add(view) gtk.main() return view.get_html() def _finished_loading(self, view, frame): with open(self._file, 'w') as f: f.write(view.get_html()) gtk.main_quit() </code></pre> <p>..Various subroutines that just handle the BeautifulSoup end of things, processing the pages, pulling out links, tidying them up etc...</p> <pre><code>def main(): urls=anydbm.open('./urls','n') domain = "stackoverflow.com" baseUrl = 'http://'+domain urls['/']='0' while (check_done(urls) == 0): count = 0 foundurls=anydbm.open('./foundurls','n') for url, done in urls.iteritems(): if done == 1: continue print "Processing",url urls[str(url)] = '1' if (re.search(".*\/$",url)): outfile=domain+url+"index.html" elif (os.path.isdir(os.path.dirname(os.path.abspath(outfile)))): outfile=domain+url+"index.html" else: outfile=domain+url if not os.path.exists(os.path.dirname(os.path.abspath(outfile))): os.makedirs(os.path.dirname(os.path.abspath(outfile))) crawler = Crawler(baseUrl+url, outfile) html=crawler.crawl() soup = BeautifulSoup(html.__str__()) for link in hrefs(soup,baseUrl): if not foundurls.has_key(str(link)): foundurls[str(link)] = '0' del(html) # this is an attempt to get the object to vanish, tried del(Crawler) to no avail if count==5: scanner.dump_all_objects( 'filename' ) count = 0 else: count=count+1 for url, done in foundurls.iteritems(): if not urls.has_key(str(url)): urls[str(url)]='0' foundurls.close() os.remove('./foundurls') urls.close() os.remove('./urls') if __name__ == '__main__': main() </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload