Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p><strong>N-gram Language Models</strong></p> <p>You could try training one <strong><a href="http://en.wikipedia.org/wiki/N-gram" rel="nofollow noreferrer">n-gram language model</a></strong> on the autogenerated spam pages and one on a collection of other non-spam webpages.</p> <p>You could then simply score new pages with both language models to see if the text looks more similar to the spam webpages or regular web content. </p> <p><strong>Better Scoring through Bayes Law</strong></p> <p>When you score a text with the spam language model, you get an estimate of the probability of finding that text on a spam web page, <code>P(Text|Spam)</code>. The notation reads as the probability of <code>Text</code> given <code>Spam (page)</code>. The score from the non-spam language model is an estimate of the probability of finding the text on a non-spam web page, <code>P(Text|Non-Spam)</code>. </p> <p>However, the term you probably really want is <code>P(Spam|Text)</code> or, equivalently <code>P(Non-Spam|Text)</code>. That is, you want to know <strong>the probability that a page is <code>Spam</code> or <code>Non-Spam</code> given the text that appears on it</strong>.</p> <p>To get either of these, you'll need to use <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem" rel="nofollow noreferrer"><strong>Bayes Law</strong></a>, which states</p> <pre><code> P(B|A)P(A) P(A|B) = ------------ P(B) </code></pre> <p>Using Bayes law, we have</p> <pre><code>P(Spam|Text)=P(Text|Spam)P(Spam)/P(Text) </code></pre> <p>and</p> <pre><code>P(Non-Spam|Text)=P(Text|Non-Spam)P(Non-Spam)/P(Text) </code></pre> <p><code>P(Spam)</code> is your <strong>prior belief</strong> that a page selected at random from the web is a spam page. You can estimate this quantity by counting how many spam web pages there are in some sample, or you can even use it as a parameter that you manually <strong>tune to trade-off <a href="http://en.wikipedia.org/wiki/Precision_and_recall" rel="nofollow noreferrer">precision and recall</a></strong>. For example, giving this parameter a high value will result in fewer spam pages being mistakenly classified as non-spam, while given it a low value will result in fewer non-spam pages being accidentally classified as spam.</p> <p>The term <code>P(Text)</code> is the overall probability of finding <code>Text</code> on any webpage. If we ignore that <code>P(Text|Spam)</code> and <code>P(Text|Non-Spam)</code> were determined using different models, this can be calculated as <code>P(Text)=P(Text|Spam)P(Spam) + P(Text|Non-Spam)P(Non-Spam)</code>. This sums out the binary variable <code>Spam</code>/<code>Non-Spam</code>. </p> <p><strong>Classification Only</strong></p> <p>However, if you're not going to use the probabilities for anything else, you don't need to calculate <code>P(Text)</code>. Rather, you can just compare the numerators <code>P(Text|Spam)P(Spam)</code> and <code>P(Text|Non-Spam)P(Non-Spam)</code>. If the first one is bigger, the page is most likely a spam page, while if the second one is bigger the page is mostly likely non-spam. This works since the equations above for both <code>P(Spam|Text)</code> and <code>P(Non-Spam|Text)</code> are normalized by the <strong>same</strong> <code>P(Text)</code> value. </p> <p><strong>Tools</strong></p> <p>In terms of software toolkits you could use for something like this, <a href="http://www-speech.sri.com/projects/srilm/download.html" rel="nofollow noreferrer">SRILM</a> would be a good place to start and it's free for non-commercial use. If you want to use something commercially and you don't want to pay for a license, you could use <a href="http://sourceforge.net/projects/irstlm/" rel="nofollow noreferrer">IRST LM</a>, which is distributed under the LGPL. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload