Note that there are some explanatory texts on larger screens.

plurals
  1. POPython - BeautifulSoup html parsing handle gbk encoding poorly - Chinese webscraping problem
    primarykey
    data
    text
    <p>I have been tinkering with the following script:</p> <pre><code># -*- coding: utf8 -*- import codecs from BeautifulSoup import BeautifulSoup, NavigableString, UnicodeDammit import urllib2,sys import time try: import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py timeoutsocket.setDefaultSocketTimeout(10) except ImportError: pass h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f' address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read() soup=BeautifulSoup(address) p=soup.findAll('p') t=p[2].string[:10] </code></pre> <p>with the following output:</p> <blockquote> <blockquote> <blockquote> <p>print t</p> </blockquote> </blockquote> </blockquote> <p>¡¡¡¡ÐÅϢͨ</p> <blockquote> <blockquote> <blockquote> <p>print h</p> </blockquote> </blockquote> </blockquote> <p>  信息通</p> <blockquote> <blockquote> <blockquote> <p>t</p> </blockquote> </blockquote> </blockquote> <p>u'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'</p> <blockquote> <blockquote> <blockquote> <p>h</p> </blockquote> </blockquote> </blockquote> <p>u'\u3000\u3000\u4fe1\u606f\u901a'</p> <blockquote> <blockquote> <blockquote> <p>h.encode('gbk')</p> </blockquote> </blockquote> </blockquote> <p>'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'</p> <p>Simply put: When I pass in this html through BeautifulSoup, it takes the gbk encoded text and thinks that it is unicode, not recognizing that it needs to be decoded first. "h" and "t" should be the same, however, as h is just me taking the text from the html file and converting it manually.</p> <p>how do I solve this problem?</p> <p>best</p> <p>wheaton </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload