Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>It's not quite the same thing as dead keys, but it's related. As you've worked out, U+00F6 and U+006F followed by U+0308 have the same visual result.</p> <p>There are in fact Unicode rules in knowing to treat them the same, which is based on decompositions. There's a decomposition table in the character database, that tells us that U+00F6 <em>canonically</em> decomposes to U+006F followed by U+0308.</p> <p>As well as canonical decomposition, there are compatibility decompositions. These lose some information, for example <code>²</code> ends up being decomposed to <code>2</code>. This is clearly a destructive change, but it is useful for searching when you want to be a bit fuzzy (how google knows a search for <code>fiſh</code> should return results about fish).</p> <p>If there are more than one combining character after a non-combining character, then we can re-order them <strong>as long as we don't re-order those of the same class</strong>. This becomes clear when we consider that it doesn't matter whether we put a cedilla on something and then an acute accent, or an acute and then a cedilla, but if we put both an acute and an umlaut on a letter it clearly matters what way around they go.</p> <p>From this, we have 4 normalisation forms. Put strings into an appropriate normalisation form before doing comparisons, and you don't get tripped up.</p> <p>NFD: Break everything apart by <strong>canonically</strong> decomposing it as much as possible. Reorder combining characters in order of their combining class, but keep any with the same class in the same order relative to each other.</p> <p>NFC: First put everything into NFD. Then continually look at the combining characters in order, if there isn't an earlier one of the same class. If there is an equivalent single character, then replace them, and re-do the scan looking to compose further.</p> <p>NFKD: Like NFD, but using compatibility decomposition (damaging change, but useful for comparisons as explained above).</p> <p>NFD: Do NFKD, then re-combine <strong>canonical</strong> only as per NFC.</p> <p>There are also some re-combinations banned from use in NFC so that text that was valid NFC in one version of Unicode doesn't cease to be NFC if Unicode has more characters added to it.</p> <p>Of NFD and NFC, NFC is clearly the more concise. It's not the most concise possible, but it is one that is very concise and can be tested for and/or created in a very efficient streaming manner.</p> <p>Mac OSX uses NFD for file names. Because they're weirdos. (Okay, there are better arguments than that, they just didn't convince me!)</p> <p>The Web Character Model uses NFC.* As such, you should use NFC on web stuff as much as possible. There can though be security considerations in blindly converting stuff to NFC. But if it starts from you, it should start in NFC.</p> <p>Any programming language that deals with text should have a nice way of normlising text into any of these forms. If yours doesn't complain (or if yours is open source, contribute!).</p> <p>See <a href="http://unicode.org/faq/normalization.html" rel="noreferrer">http://unicode.org/faq/normalization.html</a> for more, or <a href="http://unicode.org/reports/tr15/" rel="noreferrer">http://unicode.org/reports/tr15/</a> for the full gory details.</p> <p>*For extra fun, if you inserted something beginning with a combining long solidus overlay (U+0338) at the start of an XML or HTML element's content, it would turn the <code>&gt;</code> of the tag into <code>≯</code>, turning well-formed XML into gibberish. For this reason the web character model insists that each entity must itself be NFC and not start with a combining character.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload