Classifying a dataset with a decision tree with ID3 algorithm

In this tutorial, we will create a decision tree to classify data. But before we dive down to how to do that, let us get some of the fundamentals out of the way.

What are graphs and trees?

A graph is written as an ordered pair \(G = (V, E)\), where \(V\) is a set of vertices (or nodes) and \(E\) is a set of edges. Each edge, \(e_i\), in the set of edges, \(E\), connects two vertices from \(V\).

A graph with 6 vertices and 7 edges

A tree is an undirected graph where any two vertices are connected by only one path. In other words, any undirected, connected graph is a tree.

A tree with 6 vertices and 5 edges

A tree with \(n\) vertices always has \({n-1}\) edges.

What is a decision tree?

A decision tree is a flowchart-like structure in which each internal node is a decision node, and each leaf is a class. As we traverse the tree from the root, we make a decision based on some parameter in each internal node and based on that decision, we move to the next node. When we arrive at a leaf node, we have successfully traversed the classification path and are ready to classify our query.

For example, consider the following dataset –

Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Learning a Decision Tree

We start with the set of all data and try to partition them into different subsets based on an attribute. Once we have partitioned the set, we do the same again recursively on each partitioned subset until we have partitioned them into subsets such that either the partitioned subset contains elements of only one class or there is no way to make more partitions. This top-down partitioning method is an example of a greedy algorithm.

Now given any set, the question that we have to answer is – which attribute do we select for the partitioning? We want to select an attribute such that it splits the classes (YES and NO in the above dataset) with the minimum impurity (the amount of mixture of YES and NO classes in our case). This measure of impurity or randomness is given by entropy.

Entropy and Information Gain

Let’s get the formulas out of the way. Don’t be scared – they look scarier than they actually are.

Entropy is the measure of the impurity (chaos) in the data. It is given by –

\(H(S) = \sum\limits_{x {\in} X}p(x)\times\log_2{\frac{1}{p(x)}} \equiv -\sum\limits_{x {\in} X}p(x)\times\log_2{p(x)} \)


  • \(S\) is the set for which the entropy is being calculated
  • \(X\) is the set of all classes belonging to the set \(S\)
  • \(P(x)\) is the proportion of class \(x\) of the total number of elements in \(S\)

Information Gain is the amount of change in entropy on splitting the dataset based on any attribute. It is the amount of gain in information (or reduction in entropy/impurity/chaos) after splitting the dataset. It is given by –

\(IG(T,a) = H(T)-H(T|a)\)

i.e., information gain on the set \(T\) on splitting on attribute \(a\) is the entropy after the split on attribute \(a\) subtracted from the entropy before the split. Phew.!!

Confused? Let’s work out an example

Refer the dataset at the beginning of the table. Let us work out the example for the attribute windy. Before the split, there are 14 total data instances – 9 has output YES and 5 has output NO. So, the entropy before the split is –

\(I_E([9,5]) = -\frac{9}{14}\times\log_2\frac{9}{14}-\frac{5}{14}\times\log_2\frac{5}{14} = 0.940286\)

Observe that the attribute windy has two values – Strong and Weak. If we take \(windy=Strong\), then we have 6 of the 14 instances of the original data, out of which 3 are YES and 3 are NO. So, the entropy after splitting on windy, on the branch given by \(windy=Strong\) is –

\(I_E([3,3]) = -\frac{3}{6}\times\log_2\frac{3}{6}-\frac{3}{6}\times\log_2\frac{3}{6} = -\frac{1}{2}\times\log_2\frac{1}{2}-\frac{1}{2}\times\log_2\frac{1}{2} = 1\)

Similarly, on \(windy=Weak\) we have 8 instances, out of which 6 are YES and 2 are NO. So, the entropy is given by –

\(I_E([6,2]) = -\frac{6}{8}\times\log_2\frac{6}{8}-\frac{2}{8}\times\log_2\frac{2}{8} = -\frac{3}{4}\times\log_2\frac{3}{4}-\frac{1}{4}\times\log_2\frac{1}{4} = 0.8112781\)

To find the total entropy after the split, we take the weighted sum of the entropy of all the classes (twoYES and NO, in this case) –

\(I_E([3,3],[6,2]) = I_E({windy\: or\: not}) = -\frac{6}{14}\times{1}-\frac{8}{14}\times{0.8112781} = 0.8921589\)

Finally, we are ready to find the information gain if we split the dataset on the attribute windy. It is the difference between the entropy before and after the split –

\(IG(windy) = I_E([9,5]) – I_E([3,3],[6,2]) = 0.940286 – 0.8921589 = 0.0481271\)

Now what?

We have found out the Information Gain on the attribute Windy. Now we have to find out the information gain on the other three attributes – Outlook, Temperature, and Humidity. And then split the dataset into subsets based on the attribute that gives us the highest Information Gain. Then we recursively keep splitting the data subsets using the above technique until we reach one of the following base cases –

  • Every element in the subset belongs to the same class
  • There are no more attributes to be selected, but the examples still do not belong to the same class
  • There are no examples in the subset


  1. Calculate the entropy of every attribute using the dataset \(S\)
  2. Split the set S into subsets using the attribute for which the resulting entropy (after splitting) is minimum (or, equivalently, information gain is maximum)
  3. Make a decision tree node containing that attribute
  4. Recurse on subsets using remaining attributes.