Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.</p> <ol> <li>You hardcode into your program the fact that you have the same number of spam and non-spam emails. I'd recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.</li> <li>You hadrcode into your program some number that you treat as the vocabulary size. I'd not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it's incorrect. I'd recommend to calculate it during learning.</li> <li>This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that's not essential for obtaining correct results - even without this filtering, my implementation is getting only several emails unclassified.</li> <li>You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not <code>1</code> which you add, but <code>log(1/(spamSize+vocabSize))</code> or <code>log(1/(nonspamSize+vocabSize))</code> depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.</li> <li>You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.</li> <li>Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.</li> </ol> <p><strong>Update</strong>. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.<br> Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.<br> Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use <code>pass</code>, not <code>continue</code>, as the latter immediately goes to the next <code>for w in words:</code> iteration.<br> The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload