StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POProblem with Precision floating point operation in C
text
Body
copied!<p>For one of my course project I started implementing "Naive Bayesian classifier" in C. My project is to implement a document classifier application (especially Spam) using huge training data.</p> <p>Now I have problem implementing the algorithm because of the limitations in the C's datatype. </p> <p>( Algorithm I am using is given here, <a href="http://en.wikipedia.org/wiki/Bayesian_spam_filtering" rel="nofollow noreferrer">http://en.wikipedia.org/wiki/Bayesian_spam_filtering</a> )</p> <p>PROBLEM STATEMENT: The algorithm involves taking each word in a document and calculating probability of it being spam word. If p1, p2 p3 .... pn are probabilities of word-1, 2, 3 ... n. The probability of doc being spam or not is calculated using </p> <p><img src="https://upload.wikimedia.org/math/f/1/d/f1d1c65ee72c294f1fc9b4eb156f5768.png" alt="alt text"></p> <p>Here, probability value can be very easily around 0.01. So even if I use datatype "double" my calculation will go for a toss. To confirm this I wrote a sample code given below.</p> <pre><code>#define PROBABILITY_OF_UNLIKELY_SPAM_WORD (0.01) #define PROBABILITY_OF_MOSTLY_SPAM_WORD (0.99) int main() { int index; long double numerator = 1.0; long double denom1 = 1.0, denom2 = 1.0; long double doc_spam_prob; /* Simulating FEW unlikely spam words */ for(index = 0; index < 162; index++) { numerator = numerator*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD; denom2 = denom2*(long double)PROBABILITY_OF_UNLIKELY_SPAM_WORD; denom1 = denom1*(long double)(1 - PROBABILITY_OF_UNLIKELY_SPAM_WORD); } /* Simulating lot of mostly definite spam words */ for (index = 0; index < 1000; index++) { numerator = numerator*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD; denom2 = denom2*(long double)PROBABILITY_OF_MOSTLY_SPAM_WORD; denom1 = denom1*(long double)(1- PROBABILITY_OF_MOSTLY_SPAM_WORD); } doc_spam_prob= (numerator/(denom1+denom2)); return 0; } </code></pre> <p>I tried Float, double and even long double datatypes but still same problem. </p> <p>Hence, say in a 100K words document I am analyzing, if just 162 words are having 1% spam probability and remaining 99838 are conspicuously spam words, then still my app will say it as Not Spam doc because of Precision error (as numerator easily goes to ZERO)!!!.</p> <p>This is the first time I am hitting such issue. So how exactly should this problem be tackled?</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload