Decision trees are classifiers for instances represented as feature vectors.
Nodes are
tests for feature values. There is one branch for each value of the feature. Leaves specify the category.
The central choice in the algorithm to build a decision tree is selecting which attribute to test at each node in the tree. We would like to select the attribute that is most useful for classifying examples. A good quantitative measure of the worth of an attribute is
information gain, which measures how well a given attribute separates the training examples according to their target classification.
Information gain is related to
entropy, that characterizes the (im)
purity of an arbitrary collection of examples. Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is
$$Entropy(S)\equiv -p_\oplus \log_2 p_\oplus-p_\ominus\log_2 p_\ominus$$
where \(p_\oplus\) is the proportion of positive examples in S and \(p_\ominus\) is the proportion of negative examples in S.
Information gain is simply the expected reduction in entropy by partitioning the examples according to this attribute. Gain(S, A) of an attribute A, relative to a collection of examples S, is define as
$$Gain(S, A) \equiv Entropy(S) - \sum_{v\in Values(A)}\frac{|S_v|}{|S|}Entropy(S_v)$$