Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to remove invalid UTF-8 characters from a JavaScript string?
    primarykey
    data
    text
    <p>I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:</p> <p><code>strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1"); </code></p> <p>It seems that the UTF-8 validation regex described here <em>(link removed)</em> is more complete and I adapted it in the same way like:</p> <p><code>strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1"); </code></p> <p>Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt" rel="noreferrer">UTF-8 decoder capability and stress test</a>. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.</p> <p>I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.</p> <p>Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload