In this tutorial, we will create a decision tree to classify data. But before we dive down to how to do that, let us get some of the fundamentals out of the way.

#### What are graphs and trees?

A graph is written as an ordered pair \(G = (V, E)\), where \(V\) is a set of vertices (or nodes) and \(E\) is a set of edges. Each edge, \(e_i\), in the set of edges, \(E\), connects two vertices from \(V\).

A tree is an undirected graph where any two vertices are connected by only one path. In other words, any undirected, connected graph is a tree.

A tree with \(n\) vertices always has \({n-1}\) edges.

#### What is a decision tree?

A decision tree is a flowchart-like structure in which each internal node is a decision node, and each leaf is a class. As we traverse the tree from the root, we make a decision based on some parameter in each internal node and based on that decision, we move to the next node. When we arrive at a leaf node, we have successfully traversed the classification path and are ready to classify our query.

For example, consider the following dataset –

Day | Outlook | Temperature | Humidity | Wind | PlayTennis |
---|---|---|---|---|---|

D1 | Sunny | Hot | High | Weak | No |

D2 | Sunny | Hot | High | Strong | No |

D3 | Overcast | Hot | High | Weak | Yes |

D4 | Rain | Mild | High | Weak | Yes |

D5 | Rain | Cool | Normal | Weak | Yes |

D6 | Rain | Cool | Normal | Strong | No |

D7 | Overcast | Cool | Normal | Strong | Yes |

D8 | Sunny | Mild | High | Weak | No |

D9 | Sunny | Cool | Normal | Weak | Yes |

D10 | Rain | Mild | Normal | Weak | Yes |

D11 | Sunny | Mild | Normal | Strong | Yes |

D12 | Overcast | Mild | High | Strong | Yes |

D13 | Overcast | Hot | Normal | Weak | Yes |

D14 | Rain | Mild | High | Strong | No |

#### Learning a Decision Tree

We start with the set of all data and try to partition them into different subsets based on an attribute. Once we have partitioned the set, we do the same again recursively on each partitioned subset until we have partitioned them into subsets such that either the partitioned subset contains elements of only one class or there is no way to make more partitions. This top-down partitioning method is an example of a greedy algorithm.

Now given any set, the question that we have to answer is – which attribute do we select for the partitioning? We want to select an attribute such that it splits the classes (**YES** and **NO** in the above dataset) with the minimum impurity (the amount of mixture of **YES** and **NO** classes in our case). This measure of impurity or randomness is given by entropy.

#### Entropy and Information Gain

Let’s get the formulas out of the way. Don’t be scared – they look scarier than they actually are.

**Entropy** is the measure of the impurity (chaos) in the data. It is given by –

\(H(S) = \sum\limits_{x {\in} X}p(x)\times\log_2{\frac{1}{p(x)}} \equiv -\sum\limits_{x {\in} X}p(x)\times\log_2{p(x)} \)

where,

- \(S\) is the set for which the entropy is being calculated
- \(X\) is the set of all classes belonging to the set \(S\)
- \(P(x)\) is the proportion of class \(x\) of the total number of elements in \(S\)

**Information Gain** is the amount of change in entropy on splitting the dataset based on any attribute. It is the amount of gain in information (or reduction in entropy/impurity/chaos) after splitting the dataset. It is given by –

\(IG(T,a) = H(T)-H(T|a)\)

i.e., information gain on the set \(T\) on splitting on attribute \(a\) is the entropy after the split on attribute \(a\) subtracted from the entropy before the split. Phew.!!

#### Confused? Let’s work out an example

Refer the dataset at the beginning of the table. Let us work out the example for the attribute **windy**. Before the split, there are **14** total data instances – **9** has output **YES** and **5** has output **NO**. So, the entropy before the split is –

\(I_E([9,5]) = -\frac{9}{14}\times\log_2\frac{9}{14}-\frac{5}{14}\times\log_2\frac{5}{14} = 0.940286\)

Observe that the attribute windy has two values – **Strong** and **Weak**. If we take \(windy=Strong\), then we have **6** of the **14** instances of the original data, out of which **3** are **YES** and **3** are **NO**. So, the entropy after splitting on windy, on the branch given by \(windy=Strong\) is –

\(I_E([3,3]) = -\frac{3}{6}\times\log_2\frac{3}{6}-\frac{3}{6}\times\log_2\frac{3}{6} = -\frac{1}{2}\times\log_2\frac{1}{2}-\frac{1}{2}\times\log_2\frac{1}{2} = 1\)

Similarly, on \(windy=Weak\) we have **8** instances, out of which **6** are **YES** and **2** are **NO**. So, the entropy is given by –

\(I_E([6,2]) = -\frac{6}{8}\times\log_2\frac{6}{8}-\frac{2}{8}\times\log_2\frac{2}{8} = -\frac{3}{4}\times\log_2\frac{3}{4}-\frac{1}{4}\times\log_2\frac{1}{4} = 0.8112781\)

To find the total entropy after the split, we take the weighted sum of the entropy of all the classes (**two** – **YES** and **NO**, in this case) –

\(I_E([3,3],[6,2]) = I_E({windy\: or\: not}) = -\frac{6}{14}\times{1}-\frac{8}{14}\times{0.8112781} = 0.8921589\)

Finally, we are ready to find the **information gain** if we split the dataset on the attribute **windy**. It is the difference between the entropy before and after the split –

\(IG(windy) = I_E([9,5]) – I_E([3,3],[6,2]) = 0.940286 – 0.8921589 = 0.0481271\)

#### Now what?

We have found out the Information Gain on the attribute **Windy**. Now we have to find out the information gain on the other three attributes – **Outlook**, **Temperature**, and **Humidity**. And then split the dataset into subsets based on the attribute that gives us the highest Information Gain. Then we recursively keep splitting the data subsets using the above technique until we reach one of the following base cases –

- Every element in the subset belongs to the same class
- There are no more attributes to be selected, but the examples still do not belong to the same class
- There are no examples in the subset

#### Summary

- Calculate the entropy of every attribute using the dataset \(S\)
- Split the set S into subsets using the attribute for which the resulting entropy (after splitting) is minimum (or, equivalently, information gain is maximum)
- Make a decision tree node containing that attribute
- Recurse on subsets using remaining attributes.