StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
1859910
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
16
CommunityOwnedDate
CreationDate
2009-12-07T13:16:07.163
FavoriteCount
0
LastActivityDate
2016-02-04T13:01:23.293
LastEditDate
2016-02-04T13:01:23.293
LastEditorUserId
2266481
OwnerUserId
97160
ParentId
1859554
PostTypeId
2
Score
963
ViewCount
0
LastEditorDisplayName
text
Body
I assume entropy was mentioned in the context of building <a href="https://en.wikipedia.org/wiki/Decision_tree_learning" rel="noreferrer">decision trees</a>. To illustrate, imagine the task of <a href="https://en.wikipedia.org/wiki/Supervised_learning" rel="noreferrer">learning</a> to <a href="https://en.wikipedia.org/wiki/Statistical_classification" rel="noreferrer">classify</a> first-names into male/female groups. That is given a list of names each labeled with either <code>m</code> or <code>f</code>, we want to learn a <a href="https://en.wikipedia.org/wiki/Predictive_modelling" rel="noreferrer">model</a> that fits the data and can be used to predict the gender of a new unseen first-name. <pre><code>name gender ----------------- Now we want to predict Ashley f the gender of "Amro" (my name) Brian m Caroline f David m </code></pre> First step is <a href="https://en.wikipedia.org/wiki/Feature_selection" rel="noreferrer">deciding</a> what <a href="https://en.wikipedia.org/wiki/Feature_%28machine_learning%29" rel="noreferrer">features</a> of the data are relevant to the target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like: <pre><code># name ends-vowel num-vowels length gender # ------------------------------------------------ Ashley 1 3 6 f Brian 0 2 5 m Caroline 1 4 8 f David 0 2 5 m </code></pre> The goal is to build a <a href="https://en.wikipedia.org/wiki/Decision_tree" rel="noreferrer">decision tree</a>. An example of a <a href="https://en.wikipedia.org/wiki/Tree_%28data_structure%29" rel="noreferrer">tree</a> would be: <pre><code>length<7 | num-vowels<3: male | num-vowels>=3 | | ends-vowel=1: female | | ends-vowel=0: male length>=7 | length=5: male </code></pre> basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach a leaf node which contains the class prediction (<code>m</code> or <code>f</code>) So if we run the name Amro down this tree, we start by testing "is the length<7?" and the answer is yes, so we go down that branch. Following the branch, the next test "is the number of vowels<3?" again evaluates to true. This leads to a leaf node labeled <code>m</code>, and thus the prediction is male (which I happen to be, so the tree predicted the outcome <a href="https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers" rel="noreferrer">correctly</a>). The decision tree is <a href="https://en.wikipedia.org/wiki/ID3_algorithm" rel="noreferrer">built in a top-down fashion</a>, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class). This measure of purity is called the <a href="https://en.wikipedia.org/wiki/Information_theory" rel="noreferrer">information</a>. It represents the <a href="https://en.wikipedia.org/wiki/Expected_value" rel="noreferrer">expected</a> amount of <a href="https://en.wikipedia.org/wiki/Self-information" rel="noreferrer">information</a> that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example that reached the node. We calculate it based on the number of male and female classes at the node. <a href="https://en.wikipedia.org/wiki/Information_entropy" rel="noreferrer">Entropy</a> on the other hand is a measure of impurity (the opposite). It is defined for a <a href="https://en.wikipedia.org/wiki/Binary_classification" rel="noreferrer">binary class</a> with values <code>a</code>/<code>b</code> as: <pre><code>Entropy = - p(a)*log(p(a)) - p(b)*log(p(b)) </code></pre> This <a href="https://en.wikipedia.org/wiki/Binary_entropy_function" rel="noreferrer">binary entropy function</a> is depicted in the figure below (random variable can take one of two values). It reaches its maximum when the probability is <code>p=1/2</code>, meaning that <code>p(X=a)=0.5</code> or similarly<code>p(X=b)=0.5</code> having a 50%/50% chance of being either <code>a</code> or <code>b</code> (uncertainty is at a maximum). The entropy function is at zero minimum when probability is <code>p=1</code> or <code>p=0</code> with complete certainty (<code>p(X=a)=1</code> or <code>p(X=a)=0</code> respectively, latter implies <code>p(X=b)=1</code>). <img src="https://i.stack.imgur.com/OUgcx.png" alt="https://en.wikipedia.org/wiki/File:Binary_entropy_plot.svg"> Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes (not just two): <img src="https://i.stack.imgur.com/vIFD7.png" alt="entropy"> (the <code>log</code> in the formula is usually taken as the <a href="https://en.wikipedia.org/wiki/Binary_logarithm" rel="noreferrer">logarithm to the base 2</a>) <hr> Back to our task of name classification, lets look at an example. Imagine at some point during the process of constructing the tree, we were considering the following split: <pre><code> ends-vowel [9m,5f] <--- the [..,..] notation represents the class / \ distribution of instances that reached a node =1 =0 ------- ------- [3m,4f] [6m,1f] </code></pre> As you can see, before the split we had 9 males and 5 females, i.e. <code>P(m)=9/14</code> and <code>P(f)=5/14</code>. According to the definition of entropy: <pre><code>Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403 </code></pre> Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of <code>ends-vowel=1</code>, we have: <pre><code>Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852 </code></pre> and the right branch of <code>ends-vowel=0</code>, we have: <pre><code>Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917 </code></pre> We combine the left/right entropies using the number of instances down each branch as <a href="https://en.wikipedia.org/wiki/Weighted_arithmetic_mean" rel="noreferrer">weight factor</a> (7 instances went left, and 7 instances went right), and get the final entropy after the split: <pre><code>Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885 </code></pre> Now by comparing the entropy before and after the split, we obtain a measure of <a href="https://en.wikipedia.org/wiki/Information_gain_in_decision_trees" rel="noreferrer">information gain</a>, or how much information we gained by doing the split using that particular feature: <pre><code>Information_Gain = Entropy_before - Entropy_after = 0.1518 </code></pre> You can interpret the above calculation as following: by doing the split with the <code>end-vowels</code> feature, we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0.1518 (measured in <a href="https://en.wikipedia.org/wiki/Bit" rel="noreferrer">bits</a> as <a href="https://en.wikipedia.org/wiki/Units_of_information" rel="noreferrer">units of information</a>). At each node of the tree, this calculation is performed for every feature, and the feature with the largest information gain is chosen for the split in a <a href="https://en.wikipedia.org/wiki/Greedy_algorithm" rel="noreferrer">greedy</a> manner (thus favoring features that produce pure splits with low uncertainty/entropy). This process is applied recursively from the root-node down, and stops when a leaf node contains instances all having the same class (no need to split it further). Note that I skipped over some <a href="https://en.wikipedia.org/wiki/C4.5_algorithm" rel="noreferrer">details</a> which are beyond the scope of this post, including how to handle <a href="https://en.wikipedia.org/wiki/Discretization_of_continuous_features" rel="noreferrer">numeric features</a>, <a href="https://en.wikipedia.org/wiki/Missing_data" rel="noreferrer">missing values</a>, <a href="https://en.wikipedia.org/wiki/Overfitting" rel="noreferrer">overfitting</a> and <a href="https://en.wikipedia.org/wiki/Pruning_%28decision_trees%29" rel="noreferrer">pruning</a> trees, etc..
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhat is "entropy and information gain"?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. UStne
UserOwnerUserId
1. USAmro
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POWhat is "entropy and information gain"?
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COAmro, can you give an explanation into Adaboost if you know anything about it please? I can't find any explanation as comprehensive as the one you showed above
 singulars
 PostPostId
 PO
 UserUserId
 USgodzilla
2. COThis is better than my professor's slides. Thank you!
 singulars
 PostPostId
 PO
 UserUserId
 USSami A. Haija

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.