StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPython, GTK, Webkit & scraping, big memory problem
primarykey
Id
7019788
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2011-08-11T01:15:54.100
FavoriteCount
1
LastActivityDate
2011-08-11T06:58:59.030
LastEditDate
2011-08-11T04:54:43.640
LastEditorUserId
144185
OwnerUserId
144185
ParentId
0
PostTypeId
1
Score
2
ViewCount
1420
LastEditorDisplayName
text
Body
I'm trying to mirror a website for content, unfortunately large chunks of it is javascript based, including code that generates hrefs. That wipes out most standard web scraping tools (like httrack) as their attempts at processing javascript, if they even do attempt it, are highly unreliable. So I decided to write my own in python and get the webkit engine to process the HTML. The procedural logic seems pretty straight forward, generate a dict with urls found as the key and the value being 0 or 1 based on whether or not it's already been processed. I managed to get the base logic to work reasonably well with pyqt4 but it kept segfaulting at random times, enough to make me distrust it, and then I found this: <a href="http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/" rel="nofollow">http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/</a> Neat script, it works, but I've never dealt with gtk in python before. Bolting my logic around it was fairly straightforward though, however its proving to be a bit of a memory hog. Profiling it with meliae shows nothing taking up that much memory, even as python is reaching 2Gb. The site has a fair few pages and the script eventually reaches the 32bit memory limit and segfaults. My assumption is that the code is spawning more and more webkit windows. I'm at a loss how to get it to actually close or destroy those windows. I've tried destroy, there is a main_quit in there, nothing seems to be closing it. Here's what should be the relevant parts (I hope), but with destination url changed. I was using dicts for url and foundurl but switched to anydbm in case they were for some bizarre reason the memory hog. I'll probably be switching back to dicts at some point: <pre><code>#!/usr/bin/env python import sys, thread import gtk import webkit import warnings from time import sleep from BeautifulSoup import BeautifulSoup import re import os import anydbm import copy from meliae import scanner warnings.filterwarnings('ignore') class WebView(webkit.WebView): def get_html(self): self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;') html = self.get_main_frame().get_title() self.execute_script('document.title=oldtitle;') self.destroy return html class Crawler(gtk.Window): def __init__(self, url, file): gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala gtk.Window.__init__(self) self._url = url self._file = file self.connect("destroy",gtk.main_quit) def crawl(self): view = WebView() view.open(self._url) view.connect('load-finished', self._finished_loading) self.add(view) gtk.main() return view.get_html() def _finished_loading(self, view, frame): with open(self._file, 'w') as f: f.write(view.get_html()) gtk.main_quit() </code></pre> ..Various subroutines that just handle the BeautifulSoup end of things, processing the pages, pulling out links, tidying them up etc... <pre><code>def main(): urls=anydbm.open('./urls','n') domain = "stackoverflow.com" baseUrl = 'http://'+domain urls['/']='0' while (check_done(urls) == 0): count = 0 foundurls=anydbm.open('./foundurls','n') for url, done in urls.iteritems(): if done == 1: continue print "Processing",url urls[str(url)] = '1' if (re.search(".*\/$",url)): outfile=domain+url+"index.html" elif (os.path.isdir(os.path.dirname(os.path.abspath(outfile)))): outfile=domain+url+"index.html" else: outfile=domain+url if not os.path.exists(os.path.dirname(os.path.abspath(outfile))): os.makedirs(os.path.dirname(os.path.abspath(outfile))) crawler = Crawler(baseUrl+url, outfile) html=crawler.crawl() soup = BeautifulSoup(html.__str__()) for link in hrefs(soup,baseUrl): if not foundurls.has_key(str(link)): foundurls[str(link)] = '0' del(html) # this is an attempt to get the object to vanish, tried del(Crawler) to no avail if count==5: scanner.dump_all_objects( 'filename' ) count = 0 else: count=count+1 for url, done in foundurls.iteritems(): if not urls.has_key(str(url)): urls[str(url)]='0' foundurls.close() os.remove('./foundurls') urls.close() os.remove('./urls') if __name__ == '__main__': main() </code></pre>
Tags
<python><memory><webkit><gtk>
Title
Python, GTK, Webkit & scraping, big memory problem
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USTwirrim
UserOwnerUserId
1. USTwirrim
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPython, GTK, Webkit & scraping, big memory problem
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POPython, GTK, Webkit & scraping, big memory problem
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.