StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow do I use CharNext in the Windows API properly?
text
Body
copied!<p>I have a multi-byte string containing a mixture of japanese and latin characters. I'm trying to copy parts of this string to a separate memory location. Since it's a multi-byte string, some of the characters uses one byte and other characters uses two. When copying parts of the string, I must not copy "half" japanese characters. To be able to do this properly, I need to be able to determine where in the multi-byte string characters starts and ends. </p> <p>As an example, if the string contains 3 characters which requires [2 byte][2 byte][1 byte], I must copy either 2, 4 or 5 bytes to the other location and not 3, since if I were copying 3 I would copy only half the second character.</p> <p>To figure out where in the multi-byte string characters starts and ends, I'm trying to use the Windows API function CharNext and CharNextExA but without luck. When I use these functions, they navigate through my string one byte at a time, rather than one character at a time. According to MSDN, CharNext is supposed to <em>The CharNext function retrieves a pointer to the next character in a string.</em>. </p> <p>Here's some code to illustrate this problem:</p> <pre><code>#include <windows.h> #include <stdio.h> #include <wchar.h> #include <string.h> /* string consisting of six "asian" characters */ wchar_t wcsString[] = L"\u9580\u961c\u9640\u963f\u963b\u9644"; int main() { // Convert the asian string from wide char to multi-byte. LPSTR mbString = new char[1000]; WideCharToMultiByte( CP_UTF8, 0, wcsString, -1, mbString, 100, NULL, NULL); // Count the number of characters in the string. int characterCount = 0; LPSTR currentCharacter = mbString; while (*currentCharacter) { characterCount++; currentCharacter = CharNextExA(CP_UTF8, currentCharacter, 0); } } </code></pre> <p>(please ignore memory leak and failure to do error checking.)</p> <p>Now, in the example above I would expect that characterCount becomes 6, since that's the number of characters in the asian string. But instead, characterCount becomes 18 because mbString contains 18 characters:</p> <pre><code>é–€é˜œé™€é˜¿é˜»é™„ </code></pre> <p>I don't understand how it's supposed to work. How is CharNext supposed to know whether "é–€é" in the string is an encoded version of a Japanese character, or in fact the characters é – € and é?</p> <p>Some notes:</p> <ul> <li>I've read Joels blog post about what every developer needs to know about Unicode. I may have misunderstood something in it though.</li> <li>If all I wanted to do was to count the characters, I could count the characters in the asian string directly. Keep in mind that my real goal is copying parts of the multi-byte string to a separate location. The separate location only supports multi-byte, not widechar.</li> <li>If I convert the content of mbString back to wide char using MultiByteToWideChar, I get the correct string (門阜陀阿阻附), which indicates that there's nothing wrong with mbString.</li> </ul> <p>EDIT: Apparantly the CharNext functions doesn't support UTF-8 but Microsoft forgot to document that. I threw/copiedpasted together my own routine, which I won't use and which needs improving. I'm guessing it's easily crashable.</p> <pre><code> LPSTR CharMoveNext(LPSTR szString) { if (szString == 0 || *szString == 0) return 0; if ( (szString[0] & 0x80) == 0x00) return szString + 1; else if ( (szString[0] & 0xE0) == 0xC0) return szString + 2; else if ( (szString[0] & 0xF0) == 0xE0) return szString + 3; else if ( (szString[0] & 0xF8) == 0xF0) return szString + 4; else return szString +1; } </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload