Note that there are some explanatory texts on larger screens.

plurals
  1. POHow do I sanitize user input for proper content-encoding before I save it?
    text
    copied!<p>I've got an application where users input text into forms.</p> <p>The data is saved into a MySQL database (collation: <code>utf8_general_ci</code>) and then output as XML (encoding: UTF-8).</p> <p>The problem is that people tend to cut and paste their information from other sources, for instance, Microsoft Word documents or PDFs for instance.</p> <p>This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in <a href="http://en.wikipedia.org/wiki/Windows-1252" rel="noreferrer">Windows-1252 encoding</a> </p> <p>This causes problems, obviously, when transforming or otherwise working on the XML because the characters are illegal.</p> <p>So, how to sanitise the input?</p> <p>Previously, I've used some fairly brute-force methods, things like the <a href="http://www.fourmilab.ch/webtools/demoroniser/" rel="noreferrer">"de-moronize" script</a> which consists of a long list of search-and-replace operations.</p> <p>Is this still the best way to do it? Is there any other way? </p> <p>Can I just set the <a href="http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset" rel="noreferrer">accept-charset attribute</a> on the form and have the browser do it for me? </p> <p>If so, which browsers will do that and are there likely to be any problems?</p> <p>Also, how come my database is accepting these characters, which are reserved/control characters in UTF-8?</p> <p>As you can see, I know enough about encodings to know I have a problem, but I'm now a bit out of my depth...</p> <p>TIA</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload