Note that there are some explanatory texts on larger screens.

plurals
  1. PORemoving an "empty" character item from a corpus of documents in R?
    primarykey
    data
    text
    <p>I am using the <code>tm</code> and <code>lda</code> packages in R to topic model a corpus of news articles. However, I am getting a "non-character" problem represented as <code>""</code> that is messing up my topics. Here is my workflow:</p> <pre><code>text &lt;- Corpus(VectorSource(d$text)) newtext &lt;- lapply(text, tolower) sw &lt;- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian") newtext &lt;- lapply(newtext, function(x) removePunctuation(x)) newtext &lt;- lapply(newtext, function(x) removeWords(x, sw)) newtext &lt;- lapply(newtext, function(x) removeNumbers(x)) newtext &lt;- lapply(newtext, function(x) stripWhitespace(x)) d$processed &lt;- unlist(newtext) corpus &lt;- lexicalize(d$processed) k &lt;- 40 result &lt;-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05, compute.log.likelihood = TRUE, trace = 2L) </code></pre> <p>Unfortunately, when I train the lda model, everything looks great except the most frequently occurring word is "". I try to remedy this by removing it from the vocab as given below and reestimating the model just as above: </p> <pre><code>newtext &lt;- lapply(newtext, function(x) removeWords(x, "")) </code></pre> <p>But, it's still there, as evidenced by:</p> <pre><code>str_split(newtext[[1]], " ") [[1]] [1] "" "body" "mohamed" "hassan" [5] "cook" "found" "turkish" "search" [9] "rescue" "teams" "rescued" "hospital" [13] "rescue" "teams" "continued" "search" [17] "missing" "body" "cook" "crew" [21] "wereegyptians" "sudanese" "syrians" "hassan" [25] "cook" "cargo" "ship" "sea" [29] "bright" "crashed" "thursday" "port" [33] "antalya" "southern" "turkey" "vessel" [37] "collided" "rocks" "port" "thursday" [41] "night" "result" "heavy" "winds" [45] "waves" "crew" "" </code></pre> <p>Any suggestions on how to go about removing this? Adding <code>""</code> to my list of stopwords doesn't help, either.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload