Note that there are some explanatory texts on larger screens.

plurals
  1. POPython Unicode Regular Expression
    text
    copied!<p>I am using python 2.4 and I am having some problems with unicode regular expressions. I have tried to put together a very clear and concise example of my problem. It looks as though there is some problem with how Python is recognizing the different character encodings, or a problem with my understanding. Thank you very much for taking a look! </p> <pre><code>#!/usr/bin/python # # This is a simple python program designed to show my problems with regular expressions and character encoding in python # Written by Brian J. Stinar # Thanks for the help! import urllib # To get files off the Internet import chardet # To identify charactor encodings import re # Python Regular Expressions #import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() print (chardet.detect(rawdata)) #print (rawdata) ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8 print(chardet.detect(UTF_8_encoded)) # Looks good # This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML # Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE) print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8") print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data") re_amsterdam = re.compile(".*Adobe.*", re.UNICODE) print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!? print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8") ''' # In additon, I tried this regular expression library much to the same unsatisfactory result new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*") if new_re.match(UTF_8_encoded) != None: print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8") else: print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8") if new_re.match(rawdata) != None: print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data") else: print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data") new_re = ponyguruma.Regexp(".*Adobe.*") if new_re.match(UTF_8_encoded) != None: print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8") else: print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8") new_re = ponyguruma.Regexp(".*Adobe.*") if new_re.match(rawdata) != None: print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data") else: print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data") ''' </code></pre> <p>I am working on a substitution project, and am having a difficult time with the non-ASCII encoded files. This problem is part of a bigger project - eventually I would like to substitute the text with other text (I got this working in ASCII, but I can't identify occurrences in other encodings yet.) Thanks again. </p> <p><a href="http://brian-stinar.blogspot.com" rel="nofollow noreferrer">http://brian-stinar.blogspot.com</a></p> <p>-Brian J. Stinar-</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload