Note that there are some explanatory texts on larger screens.

plurals
  1. POPython web scraping involving HTML tags with attributes
    text
    copied!<p>I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:</p> <pre><code>&lt;html&gt; &lt;body&gt; &lt;div id="container"&gt; &lt;div id="contents"&gt; &lt;table&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="author"&gt;####I want whatever is located here ###&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt; &lt;/table&gt; &lt;/div&gt; &lt;/div&gt; &lt;/body&gt; &lt;/html&gt; </code></pre> <p>I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?</p> <p>At the moment, my code looks like what is below:</p> <pre><code> import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&amp;nbsp', '&amp;#160') html=html.replace('&amp;iacute','&amp;#237') root=fromstring(html) </code></pre> <p>I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.</p> <p>EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape. </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload