StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to correctly parse a mixed latin/ideographic full text query with regex?
text
Body
copied!<p>I'm trying to sanitize/format some input using regex for a mixed latin/ideographic(chinese/japanse/korean) full text search.</p> <p>I found an old example of someone's attempt at sanitizing a latin/asian language string on a forum of which I cannot find again (full credit to the original author of this code).</p> <p>I am having trouble fully understanding the regex portion of the function in particular why it seems to be treating the numbers 0, 2, and 3 differently than the rest of the latin based numbers 1,4-9 (basically it treats the numbers 0,4-9 properly, but the numbers 0,2-3 in the query are treated as if they are Asian characters).</p> <p>For example. I am trying to sanitize the following string:<br /> "hello 1234567890 蓄積した abc123def"</p> <p>and it will turn into:<br /> "hello 1 456789 abc1 def 2 3 0 蓄積した 2 3"</p> <p>the correct output for this sanitized string should be:<br /> "hello 1234567890 蓄積した abc123def"</p> <p>As you can see it properly spaces out the Asian characters but the numbers 0, 2, 3 are treated differently than all other number. Any help on why the regex is treating those numbers 0,2 and 3 differently would be a great help (or if you know of a better way of achieving a similar result)! Thank you</p> <p>I have included the function below<br /><br /></p> <pre> function prepareString($str) { $str = mb_strtolower(trim(preg_replace('#[^\p{L}\p{Nd}\.]+#u', ' ', $str))); return trim(preg_replace('#\s\s+#u', ' ', preg_replace('#([^\12544-\65519])#u', ' ', $str) . ' ' . implode(' ', preg_split('#([\12544-\65519\s])?#u', $str, -1, PREG_SPLIT_NO_EMPTY)))); } </pre> <p><strong>UPDATE: Providing context for clarity</strong></p> <p>I am authoring a website that will be launched in China. This website will have a search function and I am trying to write a parser for the search query input.</p> <p>Unlike the English language which uses a " " as the delimiter between words in a sentence, Chinese does not use spaces between words. Because of this, I have to re-format a search query by breaking apart each Chinese character and searching for each character individually within the database. Chinese users will also use latin/english characters for things such as brand names which they can mix together with their Chinese characters (eg. Ivy牛仔舖). </p> <p>What I would like to do is separate all of the English words out from the Chinese characters, and Seperate each Chinese character with a space.</p> <p>A search query could look like this: Ivy牛仔舖</p> <p>And I would want to parse it so that it looks like this: Ivy 牛仔舖</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload