Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>To use <code>\w+</code> to match alphanumeric <em>unicode</em> characters you should pass <em>both</em> a <code>unicode</code> pattern and <code>unicode</code> text to <code>re.findall</code>.</p> <ul> <li><p>In Python2:</p> <p>Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a <code>unicode</code>:</p> <pre><code>uni = 'Привет, как дела?'.decode('utf-8') </code></pre> <p><code>ur'(?u)\w+'</code> is a <a href="https://stackoverflow.com/q/2081640/190597">raw unicode literal</a>. Even though it is not necessary here, using raw unicode/string literals for regex patterns is generally a good practice -- it allows you to avoid the need for double backslashes before certain characters such as <code>\s</code>.</p> <p>The regex pattern <code>ur'(?u)\w+'</code> <a href="https://docs.python.org/2/library/re.html#regular-expression-syntax" rel="nofollow noreferrer">bakes-in the Unicode flag</a> which tells <code>re.findall</code> to make <code>\w</code> dependent on the Unicode character properties database.</p> <pre><code>import re uni = 'Привет, как дела?'.decode('utf-8') print(re.findall(ur'(?u)\w+', uni)) </code></pre> <p>yields a list containing the 3 unicode "words":</p> <pre><code>[u'\u041f\u0440\u0438\u0432\u0435\u0442', u'\u043a\u0430\u043a', u'\u0434\u0435\u043b\u0430'] </code></pre></li> <li><p>In Python3:</p> <p>The general principle is the same, except that <a href="https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit" rel="nofollow noreferrer">what were <code>unicode</code>s in Python2 are now <code>str</code>s in Python3</a>, and there is no longer any attempt at automatic conversion between the two. So, again assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a <code>str</code>, and use a <code>str</code> regex pattern:</p> <pre><code>import re uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf') print(re.findall(r'(?u)\w+', uni)) </code></pre> <p>yields</p> <pre><code>['Привет', 'как', 'дела'] </code></pre></li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload