Note that there are some explanatory texts on larger screens.

plurals
  1. POPython - BeautifulSoup html parsing handle gbk encoding poorly - Chinese webscraping problem
    text
    copied!<p>I have been tinkering with the following script:</p> <pre><code># -*- coding: utf8 -*- import codecs from BeautifulSoup import BeautifulSoup, NavigableString, UnicodeDammit import urllib2,sys import time try: import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py timeoutsocket.setDefaultSocketTimeout(10) except ImportError: pass h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f' address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read() soup=BeautifulSoup(address) p=soup.findAll('p') t=p[2].string[:10] </code></pre> <p>with the following output:</p> <blockquote> <blockquote> <blockquote> <p>print t</p> </blockquote> </blockquote> </blockquote> <p>¡¡¡¡ÐÅϢͨ</p> <blockquote> <blockquote> <blockquote> <p>print h</p> </blockquote> </blockquote> </blockquote> <p>  信息通</p> <blockquote> <blockquote> <blockquote> <p>t</p> </blockquote> </blockquote> </blockquote> <p>u'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'</p> <blockquote> <blockquote> <blockquote> <p>h</p> </blockquote> </blockquote> </blockquote> <p>u'\u3000\u3000\u4fe1\u606f\u901a'</p> <blockquote> <blockquote> <blockquote> <p>h.encode('gbk')</p> </blockquote> </blockquote> </blockquote> <p>'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'</p> <p>Simply put: When I pass in this html through BeautifulSoup, it takes the gbk encoded text and thinks that it is unicode, not recognizing that it needs to be decoded first. "h" and "t" should be the same, however, as h is just me taking the text from the html file and converting it manually.</p> <p>how do I solve this problem?</p> <p>best</p> <p>wheaton </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload