Note that there are some explanatory texts on larger screens.

plurals
  1. POPython iterate through section using lxml
    primarykey
    data
    text
    <p>I have a webpage that I am currently parsing using BeautifulSoup but it is quite slow so I have decided to try lxml as I read it is very fast.</p> <p>Anyway, I am struggling to get my code to iterate over the section I want, not sure how to use lxml and I can't find clear documentation on it.</p> <p>Anyway, here is my code:</p> <pre><code>import urllib, urllib2 from lxml import etree def wgetUrl(target): try: req = urllib2.Request(target) req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3') response = urllib2.urlopen(req) outtxt = response.read() response.close() except: return '' return outtxt newUrl = 'http://www.tv3.ie/3player' data = wgetUrl(newUrl) parser = etree.HTMLParser() tree = etree.fromstring(data, parser) for elem in tree.iter("div"): print elem.tag, elem.attrib, elem.text </code></pre> <p>This returns all the DIV's but how do I specify to only iterate through dev id='slider1'?</p> <pre><code>div {'style': 'position: relative;', 'id': 'slider1'} None </code></pre> <p>This does not work:</p> <pre><code>for elem in tree.iter("slider1"): </code></pre> <p>I know this is probably a dumb question but I can't figure it out..</p> <p>Thanks!</p> <p><em><strong></em>* EDIT <em>*</em>*</strong></p> <p>With your help adding this code I now have the output below:</p> <pre><code>for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"): print elem[0].tag, elem[0].attrib, elem[0].text print elem[1].tag, elem[1].attrib, elem[1].text print elem[2].tag, elem[2].attrib, elem[2].text print elem[3].tag, elem[3].attrib, elem[3].text print elem[4].tag, elem[4].attrib, elem[4].text </code></pre> <p>Output:</p> <pre><code>a {'href': '/3player/show/392/57922/1/Tallafornia', 'title': '3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension'} None h3 {} None span {'id': 'gridcaption'} The Tallafornia crew are back, living in a beachside vill... span {'id': 'griddate'} 11/01/2013 span {'id': 'gridduration'} 00:27:52 </code></pre> <p>That is all brilliant but I am missing a part of the a tag above. Would the parser be not handling the code correctly?</p> <p>I'm not getting the following:</p> <pre><code>&lt;img alt="3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension" src='http://content.tv3.ie/content/videos/0378/tallaforniaep2_fri11jan2013_3player_1_57922_180x102.jpg' class='shadow smallroundcorner'&gt;&lt;/img&gt; </code></pre> <p>Any ideas why It doesn't pull this?</p> <p>Thanks again, very helpful posts..</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload