Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>There are several conventional techniques by which <em>words</em> are mapped to <em>features</em> (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.6513" rel="nofollow">classification</a>: </p> <ul> <li><p>a <em>Boolean</em> field which encodes the presence or absence of that word in a given document;</p></li> <li><p>a <em>frequency histogram</em> of a predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the last paragraph of this Answer);</p></li> <li><p>the <em>juxtaposition</em> of two or more words (e.g., 'alternative' and 'lifestyle' in consecutive order have a meaning not related either component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instance<em>emphasized text</em>;</p></li> <li><p>words as <em>raw</em> data <em>to extract latent features</em>, eg, <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis" rel="nofollow">LSA</a> or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.</p></li> </ul> <p>A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., <em>a</em>, <em>an</em>, <em>of</em>, <em>and</em>, <em>the</em>, <em>there</em>, <em>if</em>) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and <a href="http://escholarship.org/uc/item/9w56v25f.pdf" rel="nofollow">academic papers</a> presenting classification results are likewise common.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload