Decision Trees - agastya2002/IECSE-ML-Winter-2020 GitHub Wiki

Decision Trees

Till now we have covered briefly an Introduction to Python and Machine Learning, Linear and Logistic Regression. In this section, we'll look at another algorithm, which is simple to understand and yet very useful- Decision Trees.

What are Decision Trees?

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Sounds a bit complicated, isn't it?

Let's break it down.

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

The basic idea is that, any form of prediction(Classification or Regression) can be made by asking a set of simple questions, with simple answers, and each answer gives us a smaller set of possibilities for the next question.

Imagine going to a doctor with a chest congestion. Chest congestion can be a symptom of a large number of diseases. The first question that the doctor asks, is if you are a smoker or not. Depending upon the answer, it rules out a number of possibilities. Then the following relevant questions can be- the type of cough(wet or dry), frequency of coughing, and other relevant information. With each question, the number of possibilites reduce, and eventually the doctor is able to narrow down to one disease, and makes a prediction. This is how a decision tree works.

Now that we got the intuition, let's look at some terminologies used in decision trees, before moving on to the properties of it.

Important Terminology related to Decision Trees

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes with no children (no further split) is called Leaf or Terminal node.
Pruning: When we reduce the size of decision trees by removing nodes (opposite of Splitting), the process is called pruning.
Branch / Sub-Tree: A sub section of decision tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
Depth of a Tree: The longest path between the root node and a leaf node.

There are couple of algorithms there to build a decision tree , a few which are

CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric.
ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information gain as metrics.

So the main question arises. Which question to ask first, or in other words, which attribute should be used at the root node, to split the dataset?

Answer: determine the attribute that best classifies the training data; use this attribute at the root of the tree. Repeat this process at for each branch.

Advantages of Decision Trees

Easy to Understand: Unlike other ML algorithms, which work like a black-box, Decision Trees mimic human decision-making, and are easier to interpret.
It can almost fit any kind of data, and non linear relationships between features, does not affect the performance of the tree.
It can handle both numeric and categorical data.
It requires no data preprocessing.

Disadvantages of Decision Trees

Prone to Overfitting: Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Trees with greater height, tend to overfit more, and end up even segregating the outlier points or noisy data.
High Variance: Decision trees have a high variance. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
Decision Tree algorithms are greedy by nature, and greedy algorithms do not always guarantee to converge to a global optima.

Methods to avoid overfitting:

Controlling Tree Size: Since trees with bigger size, and greater complexity tend to overfit, we try to control the size, by either reducing the height, specifying the max number of splits to be made,etc.
Tree Pruning: Another method is to remove decision nodes, which are irrelevant, and are not contributing to the correct classification of the model.

Creating Decision Trees with Sci-kit Learn

Unlike Linear and Logistic Regression, we will not be implementing Decision trees from scratch. We will be using Sci-kit learn library in Python, which provides an in-built function for Decision Trees.

Further Learning

We have covered briefly, the concept behind Decision Trees, and have provided an implementation to help you. However, we strongly recommend you to study these topics in depth, before applying them on your own.

For more information, refer the following articles:

In-depth guide to Decision trees

Decision Trees in ML

Decision Trees

Decision Tree Algorithms