Note that there are some explanatory texts on larger screens.

plurals
  1. POjavascript and string manipulation w/ utf-16 surrogate pairs
    text
    copied!<p><strong>I'm</strong> working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware.</p> <p>I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings.</p> <pre><code>function sortSurrogates(str){ var cp = []; // array to hold code points while(str.length){ // loop till we've done the whole string if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character // High surrogate found low surrogate follows cp.push(str.substr(0,2)); // push the two onto array str = str.substr(2); // clip the two off the string }else{ // else BMP code point cp.push(str.substr(0,1)); // push one onto array str = str.substr(1); // clip one from string } } // loop return cp; // return the array } </code></pre> <p>My question is, is there something simpler I'm missing? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. Am I missing something simple?</p> <p>EDIT: To help illustrate the issue:</p> <pre><code>var a = "0123456789"; // U+0030 - U+0039 2 bytes each var b = ""; // U+1D7D8 - U+1D7E1 4 bytes each alert(a.length); // javascript shows 10 alert(b.length); // javascript shows 20 </code></pre> <p>Twitter sees and counts both of those as being 10 characters long.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload