Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Here is a Google patent from 2005 </p> <p><a href="http://www.google.com/patents/US20050060312" rel="noreferrer">"Systems and methods for improving the ranking of news articles"</a></p> <p>And an update from 2012:</p> <p><a href="http://www.patentstorm.us/applications/20120158711/description.html" rel="noreferrer">SYSTEMS AND METHODS FOR IMPROVING THE RANKING OF NEWS ARTICLES</a></p> <p>If you wanted to build a simple system yourself, I would do something like this:</p> <p>Take a bunch of news stories that are already classified into sports/tech/whatever.</p> <p>Tokenize them into individual words and grams (short sequences of words).</p> <p>Create a really big table with unique words and grams as the columns and individual stories as the rows:</p> <pre><code>StoryId Class word1 word2 gram1 gram2 ... 1 sports 0 0.2 0.01 0 2 tech 0.5 0.01 0 0.3 3 sports 0 0.1 0.3 0.01 </code></pre> <p>Where the values in the cells represent the frequency, binary occurrence or TF-IDF scores of the words in the documents.</p> <p>Use a classification algorithm such as Naive Bayes or Support Vector Machines to learn the weights of the columns with respect to the class labels. This is called your model. </p> <p>When you get a new, unclassified document, tokenize it the same way as before, apply the model you created earlier, and it will give you the most likely class label of the document. </p> <p>Here is my video series which includes a video on automatic document categorization:</p> <p><a href="http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html" rel="noreferrer">http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html</a></p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload