Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>There are quite a number of issues here:</p> <p><strong>About the characters</strong></p> <p>You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.</p> <p>U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).</p> <p>(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)</p> <p><strong>What's probably wrong</strong></p> <ol> <li><p>Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.</p></li> <li><p>Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement <code>SET NAMES utf8_general_ci;</code> in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. <code>SET NAMES ...</code> is specific to MySQL, but sets all the required character set variables (there are three!) at once.</p></li> <li><p>The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:</p> <ol> <li><p>If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <code>&lt;meta&gt;</code> tag.) Be sure also, that the <code>&lt;form&gt;</code> tag is not specifying a different character set.</p></li> <li><p>When converting the data, be sure that you use <strong>iconv</strong> or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.</p></li> <li><p>Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.</p></li> </ol></li> <li><p>If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.</p></li> <li><p>Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.</p></li> </ol> <p><strong>About changing character sets of tables</strong></p> <ol> <li><p>Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.</p></li> <li><p>Be careful to note the difference between <code>ALTER TABLE foo CONVERT TO CHARACTER SET ...</code> and <code>ALTER TABLE foo CHARACTER SET ...</code> The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)</p></li> </ol>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload