Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat is the best way to handle a bad link given to BeautifulSoup?
    text
    copied!<p>I'm working on something that pulls in urls from delicious and then uses those urls to discover associated feeds. </p> <p>However, some of the bookmarks in delicious are not html links and cause BS to barf. Basically, I want to throw away a link if BS fetches it and it does not look like html.</p> <p>Right now, this is what I'm getting. </p> <pre><code>trillian:Documents jauderho$ ./d2o.py "green data center" processing http://www.greenm3.com/ processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss Traceback (most recent call last): File "./d2o.py", line 53, in &lt;module&gt; get_feed_links(d_links) File "./d2o.py", line 43, in get_feed_links soup = BeautifulSoup(html) File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML) File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 108, in feed self.goahead(0) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 314, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: u'&lt;/b /&gt;', at line 739, column 1 </code></pre> <p><strong>Update:</strong></p> <p>Jehiah's answer does the trick. For reference, here's some code to get the content type:</p> <pre><code>def check_for_html(link): out = urllib.urlopen(link) return out.info().getheader('Content-Type') </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload