Note that there are some explanatory texts on larger screens.

plurals
  1. POHow do I calculate the value of information gain in order to reduce the floating-point approximation errors?
    primarykey
    data
    text
    <p>I have a dataset containing some features that belong to two class labels denoted by <em>1</em> and <em>2</em>. This dataset is procedded in order to build a decision tree: during the construction of the tree, I need to calculate the information gain to find the best partitioning of the dataset.</p> <p>Let there be <em>N1</em> features associated to label <em>1</em>, and <em>N2</em> features associated to label <em>2</em>, then the <strong>entropy</strong> can be calculated with the following formula:</p> <p><code>Entropy = - (N1/N)*log2(N1/N) - (N2/N)*log2(N2/N)</code>, where <em>N = N1 + N2</em></p> <p>I need to calculate three values of entropy in order to obtain the information gain:</p> <ul> <li><code>entropyBefore</code>, that is the entropy before the partitioning of the current dataset;</li> <li><code>entropyLeft</code>, that is the entropy of the left split after the partitioning;</li> <li><code>entropyRight</code>, that is the entropy of the right split after the partitioning.</li> </ul> <p>So, the information gain is equal to <code>entropyBefore - (S1/N)*entropyLeft - (S2/N)*entropyRight</code>, where <em>S1</em> is the number of the features of class <em>1</em> belonging to the split 1, and <em>S2</em> is the number of the features of class <em>2</em> belonging to the split 2.</p> <p>How do I calculate the value of information gain in order to reduce the floating-point approximation errors? When I apply the above formulas in those cases in which the information gain must be zero, however the calculated value is equal to a very small negative value.</p> <p><strong>UPDATE</strong> (sample code)</p> <pre><code>double N = static_cast&lt;double&gt;(this-&gt;rows()); // rows count of the dataset double entropyBefore = this-&gt;entropy(); // current entropy (before performing the split) bool firstCheck = true; double bestSplitIg; for each possible split { // ... pair&lt;Dataset,Dataset&gt; splitPair = split(...,...); double S1 = splitPair.first.rows(); double S2 = splitPair.second.rows(); double entropyLeft = splitPair.first.entropy(); double entropyRight = splitPair.second.entropy(); double splitIg = entropyBefore - (S1/N*entropyLeft + S2/N*entropyRight); if (firstCheck || splitIg &gt; bestSplitIg) { bestSplitIg = splitIg; // ... firstCheck = false; } } </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload