Note that there are some explanatory texts on larger screens.

plurals
  1. POHandling Indian Languages in BeautifulSoup
    primarykey
    data
    text
    <p>I'm trying to scrape the <a href="http://en.wikipedia.org/wiki/NDTV" rel="nofollow">NDTV</a> website for news titles. <a href="http://archives.ndtv.com/articles/2012-01.html" rel="nofollow">This</a> is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to. </p> <p>My code so far is :</p> <pre><code>import urllib2 from bs4 import BeautifulSoup htmlUrl = "http://archives.ndtv.com/articles/2012-01.html" FileName = "NDTV_2012_01.txt" fptr = open(FileName, "w") fptr.seek(0) page = urllib2.urlopen(htmlUrl) soup = BeautifulSoup(page, from_encoding="UTF-8") li = soup.findAll( 'li') for link_tag in li: hypref = link_tag.find('a').contents[0] strhyp = str(hypref) fptr.write(strhyp) fptr.write("\n") </code></pre> <p>The error I get is :</p> <pre><code>Traceback (most recent call last): File "./ScrapeTemplate.py", line 30, in &lt;module&gt; strhyp = str(hypref) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) </code></pre> <p>I got the same error even when I didn't include the <code>from_encoding</code> parameter. I initially used it as <code>fromEncoding</code>, but python warned me that it was deprecated usage.</p> <p>How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload