Note that there are some explanatory texts on larger screens.

plurals
  1. POReplacing unicode punctuation with ASCII approximations
    text
    copied!<p>I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..</p> <p>For example, the source sentence may contain the Unicode directional quotation <a href="http://tools.scarfboy.com/unicode.html?s=U%2B2018" rel="noreferrer">U2018</a> (‘) and I would like to convert that to <a href="http://tools.scarfboy.com/unicode.html?s=U%2b0027" rel="noreferrer">U0027</a> ('). Eventually I will be stripping the remaining Unicode.</p> <p>I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.</p> <p>This is what I could, but I'm sure I will make mistakes/miss things/etc.:</p> <pre><code> // double quotation (") replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\"")); // single quotation (') replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'")); </code></pre> <p>replacements is a custom class that I later run over and apply the replacements.</p> <pre><code> for (Replacement replacement : replacements) { text = replacement.pattern.matcher(text).replaceAll(r.replacement); } </code></pre> <p>As you can see, I had to find:</p> <ul> <li>LEFT SINGLE QUOTATION MARK</li> <li>RIGHT SINGLE QUOTATION MARK</li> <li>SINGLE LOW-9 QUOTATION MARK (what is this/should I replace this?)</li> <li>SINGLE HIGH-REVERSED-9 QUOTATION MARK (what is this/should I replace this?)</li> </ul>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload