Tree Analysis - Nori12/Machine-Learning-Tutorial GitHub Wiki

Machine Learning Tutorial

Tree Analysis

Visualization

We can visualize the tree using the export_graphviz function from the tree module. This writes a file in the .dot file format, which is a text file format for storing graphs. We set an option to color the nodes to reflect the majority class in each node and pass the class and features names so the tree can be properly labeled

Storing graphs

from sklearn.tree import export_graphviz

# Assuming load_breat_cancer dataset
export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
feature_names=cancer.feature_names, impurity=False, filled=True)

Reading graphs

import graphviz

with open("tree.dot") as f: dot_graph = f.read()
graphviz.Source(dot_graph)


Decision tree graph for 4 layers depth classification algorithm

The n_samples shown in each node gives the number of samples in that node, while value provides the number of samples per class.

Features Importance

Instead of looking at the whole tree, which can be taxing, there are some useful properties that we can derive to summarize the workings of the tree. The most commonly used summary is feature importance, which rates how important each feature is for the decision a tree makes. It is a number between 0 and 1 for each feature, where 0 means “not used at all” and 1 means “perfectly predicts the target.” The feature importances always sum to 1.

print("Feature importances:\n{}".format(tree.feature_importances_))

def plot_feature_importances_cancer(model):
   n_features = cancer.data.shape[1]
   plt.barh(range(n_features), model.feature_importances_, align='center') 
   plt.yticks(np.arange(n_features), cancer.feature_names) 
   plt.xlabel("Feature importance")
   plt.ylabel("Feature")
plot_feature_importances_cancer(tree)

plot_feature_importances_cancer

If a feature has a low feature_importance, it doesn’t mean that this feature is uninformative. It only means that the feature was not picked by the tree, likely because another feature encodes the same information.

In contrast to the coefficients in linear models, feature importances are always positive, and don’t encode which class a feature is indicative of. The feature importances can tell us that a specific feature is important, but not whether a high value of it is indicative of a sample belonging the a class of another. That's because a class disposal in a graph can be nonmonotonous.