Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>This is a common issue with the <code>tm</code> package (<a href="https://stat.ethz.ch/pipermail/r-help/2011-February/268398.html" rel="nofollow">1</a>, <a href="http://www.mail-archive.com/r-help@r-project.org/msg82509.html" rel="nofollow">2</a>, <a href="http://comments.gmane.org/gmane.comp.lang.r.general/229909" rel="nofollow">3</a>).</p> <p>One non-<code>R</code> way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into <code>R</code> (or use <code>gsub</code> in <code>R</code>). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. <a href="http://comments.gmane.org/gmane.comp.lang.r.general/229909" rel="nofollow">Others</a> have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.</p> <p>For an <code>R</code> solution, I found that using <code>VectorSource</code> instead of <code>DirSource</code> seems to solve the problem:</p> <pre><code># I put your example text in a file and tested it with both ANSI and # UTF-8 encodings, both enabled me to reproduce your problem # tmp &lt;- Corpus(DirSource('C:\\...\\tmp/')) tmp &lt;- tm_map(dataSet, tolower) Error in FUN(X[[1L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs' # quite similar error to what you got, both from ANSI and UTF-8 encodings # # Now try VectorSource instead of DirSource tmp &lt;- readLines('C:\\...\\tmp.txt') tmp [1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp" # looks ok so far tmp &lt;- Corpus(VectorSource(tmp)) tmp &lt;- tm_map(tmp, tolower) tmp[[1]] rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp # seems like it's worked just fine. It worked for best for ANSI encoding. # There was no error with UTF-8 encoding, but the Ö was returned # as ã– which is not good </code></pre> <p>But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload