StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHandling variable special characters in python from pulled HTML pages
text
Body
copied!<p>When I get html data using Mechanize, I store this into a variable, let's call it "HTML_RESPONSE". Once this is done, I then parse it and extract three things: Title, Short Description and Long Description.</p> <p>The problem I am facing is where the short or long description have the potential of containing characters such as - &, £, $ and so forth.</p> <p>The problem arises when I try to put this into an XML and save it, since python freaks out when I try to decode these.</p> <p>For example here is a short description from the page:</p> <pre><code>S_DESC = "Senior VP of Treasury and Corporate Finance & ERM, RTL Group, has been invited to the above conference to present a Case Study on Integrating Strategy and Risk into Enterprise Risk Management" </code></pre> <p>The way I am decoding - </p> <pre><code>#!/usr/bin/python # -*- coding: ISO-8859-1 -*- print S_DESC.decode('UTF-8').encode('ascii','xmlcharrefreplace') </code></pre> <p>This works fine on ampersands. If I then get a S_DESC with a pound sterling sign, my script breaks with this output:</p> <p><strong>UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'</strong></p> <p>Where this code fails in the portion of my script (the above exception is thrown in the very last line, every time I get a pound sterling sign). I would like to know if there is a universal way of telling python to just handle these chars on it's own. Making 100 functions for each possible incompatible character is not an option, in the same way, I am not prepared to sift through the whole website (2k + articles) in order to identify all special characters that "might" throw my code off....</p> <pre><code>XML = """ <MAIN> <ITEM> <Author>{0}</Author> <Author_UN>{1}</Author_UN> <Date_Modified>{2}</Date_Modified> <Date_Published>{3}</Date_Published> <Default_Group_Rights> {4} </Default_Group_Rights> <attachment> <file_name>{5}</file_name> <file_extension>{6}</file_extension> <file_stored_local>{7}</file_stored_local> </attachment> <title>{8}</title> <sm_desc>{9}</sm_desc> <lg_desc> <![CDATA[ {10} ]]> </lg_desc> </ITEM> </MAIN>""".format(author_soup, username, date_modified, published_date, xrights, attachment_text, file_extension, localstore, item_title.decode('UTF-8').encode('ascii','xmlcharrefreplace'), short_description.decode('UTF-8').encode('ascii','xmlcharrefreplace'), long_description.decode('UTF-8').encode('ascii','xmlcharrefreplace')) </code></pre> <h1>[EDIT]</h1> <p>This is a sample code I created which reflects the error perfectly, just in case someone wan't to have a swing at this?</p> <pre><code> #TESTING GROUND # -*- coding: UTF-8 -*- author_soup = "John Smith" username = "jsmith" date_modified = "25 December 2012, 15:42 PM" published_date = "25 December 2012, 15:42 PM" xrights = "r-w-x-x" attachment_text = "Random Attachment" file_extension = "txt" localstore = "../Local" item_title = "The NEw Financial Reforms of 2012" short_description = " £16 Billion Spent on new reforms backfire." long_description = '[<p>fullstory</p>, <p><a class="external-link" href="http://business.timesonline.co.uk/tol/business/industry_sectors/banking_and_finance/article4526065.ece">http://business.timesonline.co.uk/tol/business/industry_sectors/banking_and_finance/article4526065.ece</a></p>]' XML = """ <MAIN> <ITEM> <Author>{0}</Author> <Author_UN>{1}</Author_UN> <Date_Modified>{2}</Date_Modified> <Date_Published>{3}</Date_Published> <Default_Group_Rights> {4} </Default_Group_Rights> <attachment> <file_name>{5}</file_name> <file_extension>{6}</file_extension> <file_stored_local>{7}</file_stored_local> </attachment> <title>{8}</title> <sm_desc>{9}</sm_desc> <lg_desc> <![CDATA[ {10} ]]> </lg_desc> </ITEM> </MAIN>""".format(author_soup, username, date_modified, published_date, xrights, attachment_text, file_extension, localstore, item_title.decode('UTF-8'), short_description.decode('UTF-8'), long_description.decode('UTF-8')) </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload