Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.</p> <p>This particular rss file begins with the encoding declaration, so everything should just work:</p> <pre><code>&lt;?xml version="1.0" encoding="utf-8"?&gt; </code></pre> <p>The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.</p> <pre><code>import lxml.etree import urllib2 request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml') response = urllib2.urlopen(request).read() print [response] # ['&lt;?xml version="1.0" encoding="utf-8"?&gt;\n&lt;feed xmlns=... &lt;title&gt;Wiadomo\xc5\x9bci...'] uresponse = response.decode("utf8") print [uresponse] # [u'&lt;?xml version="1.0" encoding="utf-8"?&gt;\n&lt;feed xmlns=... &lt;title&gt;Wiadomo\u015bci...'] tree = lxml.etree.fromstring(response) res = lxml.etree.tostring(tree) print [res] # ['&lt;feed xmlns="http://www.w3.org/2005/Atom"&gt;\n&lt;title&gt;Wiadomo&amp;#347;ci...'] lres = lxml.etree.tostring(tree, encoding="latin1") print [lres] # ["&lt;?xml version='1.0' encoding='latin1'?&gt;\n&lt;feed xmlns=...&lt;title&gt;Wiadomo&amp;#347;ci...'] # works because the 38 character encoding declaration is sliced off print lxml.etree.fromstring(uresponse[38:]) # throws ValueError(u'Unicode strings with encoding declaration are not supported.',) print lxml.etree.fromstring(uresponse) </code></pre> <p>Code can be tried here: <a href="http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#" rel="nofollow">http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload