Note that there are some explanatory texts on larger screens.

plurals
  1. POWhat's the correct algorithm to determine number of user-perceived-characters?
    primarykey
    data
    text
    <p>I have the task of counting the number of perceived characters in an input. The input is a <em>group</em> of ints (we can think of it as an <code>int[]</code>) which represents Unicode code points.</p> <p><a href="http://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html#getCharacterInstance%28%29" rel="nofollow noreferrer">java.text.BreakIterator.getCharacterInstance()</a> is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.&lt;)</p> <p>I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?</p> <p><a href="http://en.wikipedia.org/wiki/Combining_character#Unicode_ranges" rel="nofollow noreferrer">Initially</a>, I'd thought that all I have to do is to combine all occurences of:</p> <ol> <li><p><code>U+0300 – U+036F</code> (combining diacritical marks)</p></li> <li><p><code>U+1DC0 – U+1DFF</code> (combining diacritical marks supplement)</p></li> <li><p><code>U+20D0 – U+20FF</code> (combining diacritical marks for symbols)</p></li> <li><p><code>U+FE20 - U+FE2F</code> (combining half marks)</p></li> </ol> <p>into the previous non-diacritic-mark.</p> <p>However I've <a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters" rel="nofollow noreferrer">realised</a> that prior to that operation, I have to first remove all non-characters as well. </p> <p>This includes:</p> <ol> <li><p><code>U+FDD0 - U+FDEF</code></p></li> <li><p>The last two code points of every plane </p></li> </ol> <p>But there seems to be more things to do. <a href="http://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters" rel="nofollow noreferrer">Unicode.org</a> states we need to include <code>U+200C</code> (zero-width non joiner) and <code>U+200D</code> (zero width joiner) as part of the set of continuing characters <a href="http://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters" rel="nofollow noreferrer">(source)</a>.</p> <p>Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for <strong>spacing combining marks</strong>, <strong>hangul jamo characters that forms hangul syllables</strong>?</p> <p>Does anyone know the correct algorithm to count the number of grapheme-clusters given an <code>int[]</code> of code points?</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload