StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POWhat's the correct algorithm to determine number of user-perceived-characters?
primarykey
Id
9097572
data
AcceptedAnswerId
0
AnswerCount
1
ClosedDate
CommentCount
7
CommunityOwnedDate
CreationDate
2012-02-01T14:33:00.620
FavoriteCount
1
LastActivityDate
2017-09-26T03:18:40.760
LastEditDate
2017-09-26T03:18:40.760
LastEditorUserId
3885376
OwnerUserId
632951
ParentId
0
PostTypeId
1
Score
12
ViewCount
1043
LastEditorDisplayName
text
Body
I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an <code>int[]</code>) which represents Unicode code points. <a href="http://docs.oracle.com/javase/7/docs/api/java/text/BreakIterator.html#getCharacterInstance%28%29" rel="nofollow noreferrer">java.text.BreakIterator.getCharacterInstance()</a> is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<) I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points? <a href="http://en.wikipedia.org/wiki/Combining_character#Unicode_ranges" rel="nofollow noreferrer">Initially</a>, I'd thought that all I have to do is to combine all occurences of: <ol> <li><code>U+0300 – U+036F</code> (combining diacritical marks)</li> <li><code>U+1DC0 – U+1DFF</code> (combining diacritical marks supplement)</li> <li><code>U+20D0 – U+20FF</code> (combining diacritical marks for symbols)</li> <li><code>U+FE20 - U+FE2F</code> (combining half marks)</li> </ol> into the previous non-diacritic-mark. However I've <a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters" rel="nofollow noreferrer">realised</a> that prior to that operation, I have to first remove all non-characters as well. This includes: <ol> <li><code>U+FDD0 - U+FDEF</code></li> <li>The last two code points of every plane </li> </ol> But there seems to be more things to do. <a href="http://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters" rel="nofollow noreferrer">Unicode.org</a> states we need to include <code>U+200C</code> (zero-width non joiner) and <code>U+200D</code> (zero width joiner) as part of the set of continuing characters <a href="http://www.unicode.org/reports/tr29/#Table_Sample_Grapheme_Clusters" rel="nofollow noreferrer">(source)</a>. Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables? Does anyone know the correct algorithm to count the number of grapheme-clusters given an <code>int[]</code> of code points?
Tags
<java><language-agnostic><text><unicode><diacritics>
Title
What's the correct algorithm to determine number of user-perceived-characters?
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USPacerier
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POWhat's the correct algorithm to determine number of user-perceived-characters?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POWhat's the correct algorithm to determine number of user-perceived-characters?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POWhat's the correct algorithm to determine number of user-perceived-characters?
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.