9.3.2.Decision Trees - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Introduction to Decision Trees

Untitled

Building a decision tree with the training set

Decision trees are built by splitting the training set into distinct nodes, where one node contains all of or most of one category of the data.

Untitled

Question

Which of the following sentences are TRUE about Decision Tree?

  • A Decision Tree is a type of clustering approach that can predict the class of a group, for example, DrugA or DrugB.
  • One node in a Decision Tree contains all of or most of, one category of the data.
  • Decision Trees are built by splitting the training set into distinct nodes

Correct

Decision tree learning algorithm

  1. Choose an attribute from the dataset
  2. Calculate the significance of attribute in splitting of data
  3. Split data based on the value of the best attribute
  4. Go to step 1

Building Decision Trees

Which attribute is the best?

What is important in making a decision tree, is to determine which attribute is the best or more predictive to split data based on the feature.

Untitled

Let's say we pick cholesterol as the first attribute to split data, it will split our data into two branches. As you can see, if the patient has high cholesterol we cannot say with high confidence that drug B might be suitable for him. Also, if the patient's cholesterol is normal, we still don't have sufficient evidence or information to determine if either drug A or drug B is in fact suitable. It is a sample of bad attributes selection for splitting data.

Untitled

Again, we have our 14 cases, this time we picked the sex attribute of patients. It will split our data into two branches, male and female. As you can see, if the patient is female, we can say drug B might be suitable for her with high certainty. But if the patient is male, we don't have sufficient evidence or information to determine if drug A or drug B is suitable. However, it is still a better choice in comparison with the cholesterol attribute because the result in the nodes are more pure. It means nodes that are either mostly drug A or drug B. So, we can say the sex attribute is more significant than cholesterol, or in other words it's more predictive than the other attributes. Indeed, predictiveness is based on decrease in impurity of nodes.

We're looking for the best feature to decrease the impurity of patients in the leaves, after splitting them up based on that feature. So, the sex feature is a good candidate in the following case because it almost found the pure patients.

Untitled

Let's go one step further. For the male patient branch, we again test other attributes to split the sub-tree. We test cholesterol again here, as you can see it results in even more pure leaves. So we can easily make a decision here. For example, if a patient is male and his cholesterol is high, we can certainly prescribe drug A, but if it is normal, we can prescribe drug B with high confidence. As you might notice, the choice of attribute to split data is very important and it is all about purity of the leaves after the split. A node in the tree is considered pure if in 100 percent of the cases, the nodes fall into a specific category of the target field.

In fact, the method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step. Impurity of nodes is calculated by entropy of data in the node.

Entropy

  • Measure of randomness or uncertainty
  • The lower the Entropy, the less uniform the distribution, the purer the node

Untitled

Which attribute is the best one to use?

Untitled

As an example, let's calculate the entropy of the data set before splitting it. We have nine occurrences of drug B and five of drug A. You can embed these numbers into the entropy formula to calculate the impurity of the target attribute before splitting it. In this case, it is 0.94. So, what is entropy after splitting?

S: [9 B, 5 A]

Is 'cholesterol' the best attribute?

Now, we can test different attributes to find the one with the most predictiveness, which results in two more pure branches.

Untitled

Let's first select the cholesterol of the patient and see how the data gets split based on its values. For example, when it is normal we have six for drug B, and two for drug A. We can calculate the entropy of this node based on the distribution of drug A and B which is 0.8 in this case. But, when cholesterol is high, the data is split into three for drug B and three for drug A. Calculating it's entropy, we can see it would be 1.0.

We should go through all the attributes and calculate the entropy after the split and then choose the best attribute.

What about 'Sex' attribute?

Let's try another field. Let's choose the sex attribute for the next check.

Untitled

As you can see, when we use the sex attribute to split the data, when its value is female, we have three patients that responded to drug B and four patients that responded to drug A. The entropy for this node is 0.98 which is not very promising. However, on the other side of the branch, when the value of the sex attribute is male, the result is more pure with sex for drug B and only one for drug A. The entropy for this group is 0.59.

Which attribute is the best?

Now, the question is between the cholesterol and sex attributes which one is a better choice? Which one is better at the first attribute to divide the data-set into two branches? Or in other words, which attribute results in more pure nodes for our drugs? Or in which tree do we have less entropy after splitting rather than before splitting?

Untitled

The sex attribute with entropy of 0.98 and 0.59 or the cholesterol attribute with entropy of 0.81 and 1.0 in it's branches. 

The answer is the tree with the higher information gain after splitting.

What is information gain?

  • Information gain is the information that can increase the level of certainty after splitting.
  • Information Gain = (Entropy before split) - (weighted entropy after split)

Untitled

We can think of information gain and entropy as opposites. As entropy or the amount of randomness decreases, the information gain or amount of certainty increases and vice versa. So, constructing a decision tree is all about finding attributes that return the highest information gain.

Which attribute is the best?

Untitled

  • Gain (s, Sex) = 0.940 - [(7/14)0.985 + (7/14)0.592] = 0.151
  • Gain (s, Cholesterol) = 0.940 - [(8/14)0.811 + (6/14)1.0] = 0.048

The question is, which attribute is more suitable? As mentioned, the tree with the higher information gained after splitting, this means the sex attribute. So, we select the sex attribute as the first splitter.

Question

What is the meaning of Entropy in Decision Tree?

  • The entropy in a node is the weighted information in its parent node.
  • The entropy in a node is the number of similar data in that node.
  • The entropy in a node is the amount of information disorder calculated in each node.

Correct

Correct way to build a decision tree

Untitled

We select the sex attribute as the first splitter. Now, what is the next attribute after branching by the sex attribute? Well, as you can guess, we should repeat the process for each branch and test each of the other attributes to continue to reach the most pure leaves.