StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>//Warning: please not that I'm not japanese language expert, and still just learning it.</p> <p>As far as I know, this problem cannot be reliably solved in 100% of cases with or without dictionary by computer algorithm. I suggest to give up. Or use dictionaries for non-perfect approach (with dictionary is better than without)</p> <p>The most important part is that kanji is not a western letter. Kanji is a symbol that is associated with concept/idea, concepts combined together point at certain object that has a (pronounceable) word associated with it. Now, kanji have list of commonly used readings("on"/"kun"), but there are exceptions out there. </p> <p>Your problem translates into: "given sequence of wildcards mixed up with letters, match every wildcard to letter sequence in given string".</p> <p>Example:</p> <p>猫に九生あり。-> +に++あり。</p> <p>Here are the difficulties you'll run into immediately.</p> <ol> <li>You cannot unambiguously split pronunciation into parts that are associated with certain kanji. For a simple things like 会社　（かいしゃ） you could tech your algorithm to recognize valid letter sequences that map to syllables (You can't split しゃ　into two different parts), but then you'll eventually run into word that has multiple parts or has special reading. Example: 木陰　（こかげ）　Which letter belongs to which kanji?</li> <li>Ambiguous split of words/phrases. I.e. a particle can be part of reading.　物の怪　”もののけ”　Now, out of those two "no"s, which one belongs to which kanji?</li> <li>Names and stylized foreign words. First two example can be solved with a dictionary. Names require special dictionary (一樹　reads as "kazuki", although　一　does not have "kazu" as its "on" or "kun" reading). Another good example is Yagami Light from Death Note that has foreign word "Light"(ライト）written as a 月　（moon/month, normally read as つき/げつ）Obviously, 月　does not read as "Light". As for foreign words, 火炎球 in "Slayers" is pronounced (by main character) as ”フャイヤーボール”　（"fireball" in english), but obviously kanji do not have those readings.</li> </ol> <p>Those problems cannot be unambigously dealt with by algorithm, hence there will be no algorithm to match string. LCS and similar routines will result in incorrect splits because of #2.</p> <p>To use a dictionary you can use data from <a href="http://www.csse.monash.edu.au/~jwb/edict.html" rel="nofollow">edict</a> or <a href="http://www.csse.monash.edu.au/~jwb/kanjidic.html" rel="nofollow">kanjidic</a>, however, that won't guarantee a perfect solution. You might also want to investigate "mecab" project.</p> <p>--EDIT--</p> <blockquote> <p>I think LCS is one of the algorithm to get it,but it think it may fail sometimes due to the continual hiragana,so is there any better algorithm to solve this problem?</p> </blockquote> <p>You cannot solve it this way. If you only want to split phrase into words, you need to load dictionary (15+ megabytes of data) and check individual words against dictionary, plus you'll have to handle verb forms to locate word in dictionary. Simple wildcard matching will not work here. </p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload