StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>You have byte data. You need Unicode data. Isn’t the library supposed to decode it for you? It has to, because you don’t have the HTTP headers and so lack the encoding.</p> <h1>EDIT</h1> <p>Bizarre though this sounds, it appears that Python does not support content decoding in its web library. If you run this program:</p> <pre><code>#!/usr/bin/env python import re import urllib.request import io import sys for s in ("stdin","stdout","stderr"): setattr(sys, s, io.TextIOWrapper(getattr(sys, s).detach(), encoding="utf8")) print("Seeking r\xe9sum\xe9s") response = urllib.request.urlopen('http://nytimes.com/') content = response.read() match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U) if match: print("success: " + match.group(0)) else: print("failure") </code></pre> <p>You get the following result:</p> <pre><code>Seeking résumés Traceback (most recent call last): File "ur.py", line 16, in <module> match = re.search(".*r\xe9sum\xe9.*", content, re.I | re.U) File "/usr/local/lib/python3.2/re.py", line 158, in search return _compile(pattern, flags).search(string) TypeError: can't use a string pattern on a bytes-like object </code></pre> <p>That means <code>.read()</code> is returning raw bytes not a real string. Maybe you can see something in the <a href="http://docs.python.org/release/3.1.3/library/urllib.request.html" rel="nofollow">doc for the <code>urllib.request</code> class</a> that I can’t see. I can’t believe they actually expect you to root around in the <code>.info()</code> return and the <code><meta></code> tags and figure out the stupid encoding on your own and then decode it so you have a real string. That would be utterly lame! I hope I’m wrong, but I spent a good time looking and couldn’t find anything useful here.</p> <p>Compare how easy doing the equivalent is in Perl:</p> <pre><code>#!/usr/bin/env perl use strict; use warnings; use LWP::UserAgent; binmode(STDOUT, "utf8"); print("Seeking r\xe9sum\xe9s\n"); my $agent = LWP::UserAgent->new(); my $response = $agent->get("http://nytimes.com/"); if ($response->is_success) { my $content = $response->decoded_content; if ($content =~ /.*r\xe9sum\xe9.*/i) { print("search success: $&\n"); } else { print("search failure\n"); } } else { print "request failed: ", $response->status_line, "\n"; } </code></pre> <p>Which when run dutifully produces:</p> <pre><code>Seeking résumés search success: <li><a href="http://hiring.nytimes.monster.com/products/resumeproducts.aspx">Search Résumés</a></li> </code></pre> <p>Are you sure you have to do this in Python? Check out how much richer and more user-friendly the Perl <a href="http://search.cpan.org/perldoc?LWP%3a%3aUserAgent" rel="nofollow"><code>LWP::UserAgent</code></a> and <a href="http://search.cpan.org/perldoc?HTTP%3a%3aResponse" rel="nofollow"><code>HTTP::Response</code></a> classes are than the equivalent Python classes. Check it out and see what I mean.</p> <p>Plus with Perl you get better Unicode support all around, such as full grapheme support, something which Python currently lacks. Given that you were trying to strip out diacritics, this seems like it would be another plus.</p> <pre><code> use Unicode::Normalize; ($unaccented = NFD($original)) =~ s/\pM//g; </code></pre> <p>Just a thought.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload