Note that there are some explanatory texts on larger screens.

plurals
  1. POstopword removal using python
    primarykey
    data
    text
    <p>All,</p> <p>I have some text that I need to clean up and I have a little algorithm that "mostly" works. </p> <pre><code>def removeStopwords(self, data): with open(r'stopwords.txt') as stopwords: wordList = [] for i in stopwords: wordList.append(i.strip()) charList = list(data) cat = ''.join(char for char in charList if not char in wordList).split() return ' '.join(cat) </code></pre> <p>Take the first line on this page. <a href="http://en.wikipedia.org/wiki/Paragraph" rel="nofollow">http://en.wikipedia.org/wiki/Paragraph</a> and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars. </p> <blockquote> <p>A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.</p> </blockquote> <p>The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.</p> <blockquote> <p>A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit</p> </blockquote> <p>Note the word "selfcontained" was "self-contained".</p> <p>EDIT: Contents of the stopwords file which is just a bunch of chars. </p> <blockquote> <p>! $ % ^ , &amp; * ( ) { } [ ] &lt; </p> <p>, . / | \ ? ~ ` : ; "</p> </blockquote> <p>Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.</p> <pre><code> cat = ''.join(data.translate(None, string.punctuation)).split() print ' '.join(cat).lower() </code></pre>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload