Note that there are some explanatory texts on larger screens.

plurals
  1. POHelp (or advice) me get started with lxml
    primarykey
    data
    text
    <p>I am trying to learn python, and I actually feel that "learn python the hardway", "a byte of python", and "head first python" are really great books. However - now that I want to start a "real" project, lxml makes me feel like a complete git.</p> <p><strong>This is what I would like to do (objectives)</strong></p> <p>I am trying to parse a newspaper sites article about politics</p> <p>The url is <a href="http://politiken.dk/politik/" rel="nofollow">http://politiken.dk/politik/</a></p> <p>The final project should</p> <ul> <li>1) each day (maybe each hour) visit the above URL</li> <li>2) for each relevant article, I want to save the url to a database. The relevant articles are in a <code>&lt;div class="w460 section_forside sec-forside"&gt;</code>. Some of the elements have images, some dont.</li> </ul> <p>I would like to save the following:</p> <ul> <li>a - the headline (<code>&lt;h1 class="top-art-header fs-26"&gt;</code>)</li> <li>b - the subheader (<code>&lt;p class="subheader-art"&gt;</code>)</li> <li><p>c - if the element has corresponding img, then the "alt" or "title" attribute</p></li> <li><p>3) visit each relevant URL and scrape the articles body and save it to the database.</p></li> <li>4) if each relevant URL is already in the database, then I skip that URL (the relevant articles as defined above are always the latest 10 published)</li> </ul> <p>The desired result should be a database table with fields:</p> <ul> <li>art.i) ID</li> <li>art.ii) URL</li> <li>art.iii) headline</li> <li>art.iiii) subheader</li> <li>art.iiiii) img alt</li> <li>art.iiiiii) article body.</li> <li>art.iiiiiii) date and time (a string located in <code>&lt;span class="date tr-upper m-top-2"&gt;</code>) </li> </ul> <p>The above is what I would like help to accomplish. Since screen-scraping is not always benovelent, I would like to explain <strong>why I want to do this</strong>.</p> <p>Basically I want to mine the data for occurences of members of parliment or political parties. I will not republish the articles, sell the data or some such thing (I have not checked the legality of my approach, but hope and think it should be legal)</p> <p>I imagine I have a table of politicians and a table of political parties.</p> <p>for each politician I will have:</p> <ul> <li>pol.i) ID</li> <li>pol.ii) first_name</li> <li>pol.iii) sur_name</li> <li>pol.iiii) party</li> </ul> <p>For each political party I will have:</p> <ul> <li>party.i) ID</li> <li>party.ii) correct-name</li> <li>party.iii) calling-name -party.iiii) abbrevation</li> </ul> <p>I want to do this for several danish newspaper sites, and then analyse if one newspaper gives prefrences to some politicians / parties - simply based on number of mentions.</p> <p>This I will also need help to do - but one step at a time :-)</p> <p>Later I would like to explore NLTK and the posibilities for sentiment mining.</p> <p>I want to see if this could turn in to a ph.d. project in political science/journalism.</p> <p><strong>This is basically what I have (i.e. nothing)</strong></p> <p>I really have a hard time wrapping my head around lxml, the concept of elements, the different parses etc. I have of course read the tutorials but I am still very much stuck.</p> <pre><code>import lxml.html url = "http://politiken.dk/politik/" root = lxml.html.parse(url).getroot() # this should retur return all the relevant elements # does not work: #relevant = root.cssselect("divi.w460 section_forside sec-forside") # the class has spaces in the name - but I can't seem to escape them? # this will return all the linked artikles headlines artikler = root.cssselect("h1.top-art-header") # narrowing down, we use the same call to get just the URLs of the articles that we have already retrieved # theese urls we will later mine, and subsequently skip retrived_urls=[] for a in root.cssselect("h1.top-art-header a"): retrived_urls.append(a) # this works. </code></pre> <p><strong>What I hope to get from the answers</strong></p> <p>First of - as long as you don't call me (very bad) names - I would continue to be happy.</p> <ul> <li>But what I really hope is a simple to understand explanation of how lxml works. If I know what tools to use for the above tasks it would be so much easier for me to really "dive into lxml". Maybe because of my short attention span, I currently get disillusioned when reading stuff way above my level of understanding, when I am not even sure that I am looking in the right place.</li> <li>If you could provide any example code that fits some of the tasks, that would be really great. I hope to turn this project into a ph.d. but I am sure this sort of thing must have been done a thousand times already? If so, it is my experience that learning from others is a great way to get smarter.</li> <li>If you feel strongly that I should forget about lxml and use eg. scrapy or html5lib then please say so :-) I started to look into html5lib because Drew Conway suggests in a blog post about python tools for the political scientist, but I couldn't find any introduction level material. Alsp lxml is what the good people at scraperwiki recommends. As per scrapy, this might be the best solution, but I am afraid that scrapy is to much of a framework - as such really good if you know what you are doing, and want to do it fast, but maybe not the best way to learn python magic.</li> <li>I plan on using a relational database, but if you think e.g. mongo would be an advantage, I will change my plans.</li> <li>Since I can't install import lxml in python 3.1 I am using 2.6. If this is wrong - please say so also.</li> </ul> <p><strong>Timeframe</strong></p> <p>I have asked a bunch of beginner questions on stackoverflow. Too many to be proud of. But with more then a fulltime job I never seem to be able to burry myself in code and just absorb the skillz I so long for. I hope this will be a question/answer that I can come back to regualy and update what I have learn, and relearn what I have forgot. This also means that this question will most likely remain active for quite some time. But I will comment on every answer that I might be lucky enough to recieve, and I will continuosly update the "what I got" section.</p> <p>Currently I feel that I might have bitten off more then I can chew - so now it's back to "head first python" and "learn python the hard way".</p> <p><strong>Final words</strong></p> <p>If you have gotten this far - you are amazing - even if you don't answer the question. You have now read a lot of simple, confused, and stupid questions (I am proud of asking thoose questions, so don't argue). You should grap a coffe and a filterless smoke and congratulate your self :-)</p> <p>Happy holidays (in Denmark we celebrate easter and currently the sun is shining like Samual Jacksons wallet in pulp fiction)</p> <p><strong>Edit's</strong></p> <p>It seems beutifulSoup is a good choice. As per the developer however BeautifulSoup is not a good choice if I want to use python3. But as per <a href="http://www.mail-archive.com/python-list@python.org/msg307978.html" rel="nofollow">this</a> I would prefer python3 (not strongly though).</p> <p>I have also discovered that there is an lxml chapter in "dive into python 3". Will look into that aswell.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload