Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I may be able to offer some insight, but it's hard to tell if my answer will be "helpful". First, I only speak and read english, so I obviously do not speak or read chinese. I do happen to be the author of <a href="http://regexkit.sourceforge.net/RegexKitLite/index.html" rel="nofollow">RegexKitLite</a>, which is an Objective-C wrapper around the ICU regex engine. This is obviously not <code>perl</code>, :).</p> <p>Despite this, the ICU regex engine happens to have a feature that sounds remarkably like what it is that you're trying to do. Specifically, the ICU regex engine contains the <code>UREGEX_UWORD</code> modifier option, which can be turned on dynamically via the normal <code>(?w:...)</code> syntax. This modifier performs the following action:</p> <blockquote> <p>Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.</p> </blockquote> <p>You can use this in a regex like <code>(?w:\b(.*?)\b)</code> to "extract" words from a string. In the ICU regex engine, it has a fairly powerful "word breaking engine" that is specifically designed to find word breaks in written languages that do not have an explicit space 'character', like english. Again, not reading or writing these languages, my understanding is that "itisroughlysomethinglikethis". The ICU word breaking engine uses heuristics, and occasionally dictionaries, to be able to find the word breaks. It is my understanding that Thai happens to be a particularly difficult case. In fact, I happen to use <code>ฉันกินข้าว</code> (Thai for "I eat rice", or so I was told) with a regex of <code>(?w)\b\s*</code> to perform a <code>split</code> operation on the string to extract the words. Without <code>(?w)</code> you can not split on word breaks. With <code>(?w)</code> it results in the words <code>ฉัน</code>, <code>กิน</code>, and <code>ข้าว</code>.</p> <p>Provided the above "sounds like the problem you're having", then this could be the reason. If this is the case, then I am not aware of any way to accomplish this in <code>perl</code>, but I wouldn't consider this opinion an authoritative answer since I use the ICU regex engine more often than the <code>perl</code> one and am clearly not properly motivated to find a working <code>perl</code> solution when I've already got one :). Hope this helps.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload