Note that there are some explanatory texts on larger screens.

plurals
  1. POUtf-8 in c++: quick & dirty tricks
    text
    copied!<p>I am aware that there are been various questions about utf-8, mainly about libraries to manipulate utf-8 'string' like objects.</p> <p>However, I am working on an 'internationalized' project (a website, of which I code a c++ backend... don't ask) where even if we deal with utf-8 we don't acutally need such libraries. Most of the times the plain std::string methods or STL algorithms are very sufficient to our needs, and indeed this is the goal of using utf-8 in the first place.</p> <p>So, what I am looking for here is a capitalization of the <em>"Quick &amp; Dirty"</em> tricks that you know of related to utf-8 stored as std::string (no const char*, I don't care about c-style code really, I've got better things to do than constantly worrying about my buffer size).</p> <p>For example, here is a <em>"Quick &amp; Dirty"</em> trick to obtain the number of characters (which is useful to know if it will fit in your display box):</p> <pre><code>#include &lt;string&gt; #include &lt;algorithm&gt; // Let's remember than in utf-8 encoding, a character may be // 1 byte: '0.......' // 2 bytes: '110.....' '10......' // 3 bytes: '1110....' '10......' '10......' // 4 bytes: '11110...' '10......' '10......' '10......' // Therefore '10......' is not the beginning of a character ;) const unsigned char mask = 0xC0; const unsigned char notUtf8Begin = 0x80; struct Utf8Begin { bool operator(char c) const { return (c &amp; mask) != notUtf8Begin; } }; // Let's count size_t countUtf8Characters(const std::string&amp; s) { return std::count_if(s.begin(), s.end(), Utf8Begin()); } </code></pre> <p>In fact I have yet to encounter a usecase when I would need anything else than the number of characters and that std::string or the STL algorithms don't offer for free since:</p> <ul> <li>sorting works as expected</li> <li>no part of a word can be confused as a word or part of another word</li> </ul> <p>I would like to know if you have other comparable tricks, both for counting and for other simple tasks.<br> I repeat, I know about <a href="http://site.icu-project.org/" rel="noreferrer">ICU</a> and <a href="http://utfcpp.sourceforge.net/" rel="noreferrer">Utf8-CPP</a>, but I am not interested in them since I don't need a full-fledged treatment (and in fact I have never needed more than the count of characters).<br> I also repeat that I am not interested in treating char*'s, they are old-fashioned.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload