Note that there are some explanatory texts on larger screens.

plurals
  1. POpython regex split any \W+ with some exceptions
    primarykey
    data
    text
    <p>it is easy to split text using regex at non-alpha characters:</p> <pre><code>tokens=re.split(r'(?u)\W+',text) #to split at any non-alpha unicode character </code></pre> <p>and <a href="https://stackoverflow.com/questions/12683201/python-re-split-to-split-by-spaces-commas-and-periods-but-not-in-cases-like">This answer</a> provides a way to split at certain characters. However, what I need is:</p> <ol> <li>splitting at any unicode non-alpha</li> <li><p>give regex the following exceptions:</p> <ul> <li>underscores "_" </li> <li>this slash"/" </li> <li>ampersand "&amp;" and at sign "@" </li> <li>fullstops surrounded by digits \d+ </li> <li>fullstops preceded by certain arbitrary strings "Mr.", "Dr."...etc</li> </ul></li> </ol> <p>I can easily detect any of these using regex, but the question is how to tell regex to have them as exceptions to the splitting at non-alpha.</p> <hr> <p>EDIT: Here is an example text I am trying to match:</p> <pre><code>text="Mr. Jones email jones@gmail.com 12.455 12,254.25 says This is@a&amp;test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. وبعدها رجعنا إلى المنزل، وقابلنا أصدقاءنا؛ وشربنا الشاي." </code></pre> <p>and here is its version in unicode (notice the non-alpha characters in Arabic u'\u060c', u'\u061b')</p> <pre><code>unicode_text=u'Mr. Jones email jones@gmail.com 12.455 12,254.25 says This is@a&amp;test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. \u0648\u0628\u0639\u062f\u0647\u0627 \u0631\u062c\u0639\u0646\u0627 \u0625\u0644\u0649 \u0627\u0644\u0645\u0646\u0632\u0644\u060c \u0648\u0642\u0627\u0628\u0644\u0646\u0627 \u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627\u061b \u0648\u0634\u0631\u0628\u0646\u0627 \u0627\u0644\u0634\u0627\u064a.' </code></pre> <p>Here is the result of the regex in the answer provided:</p> <pre><code>re.split(r'(?u)(?![\+&amp;\/@\d+\.\d+Mr\.])\W+',unicode_text) </code></pre> <blockquote> <p>[u'Mr.', u'Jones', u'email', u'jones@gmail.com', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&amp;test', u'example_cool', u'man+right', u'more/fun', u'43.35.', u'And', u'so', u'we', u'stopped.', u'And', u'then', u'we', u'started', u'again.', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a.']</p> </blockquote> <p>Notice that the regex did not split around fullstops at the end of words. So it would be nice to have something to deal with this</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload