18 01 Classification And Regression Tree - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

DecisionTreeClassifier DecisionTreeRegressor

01 Decision Tree for Classification

Sequence of if-else questions about individual features.
Objective: infer class labels.
Able to capture non-linear relationships between features and labels.
Don't require feature scaling (e.g. Standardization,..)

Decision Regions & Boundary

Decision region: region in the feature space where all instances are assigned to one class label.
Decision Boundary: surface separating different decision regions.

# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Import accuracy_score
from sklearn.metrics import accuracy_score

# Predict test set labels
y_pred = dt.predict(X_test)

# Compute test set accuracy  
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))  # 0.89

02 Classification-Tree Learning

Building Blocks of a Decision-Tree

Decision-Tree: data structure consisting of a hierarchy of nodes
Node: question or prediction
- Root: no parent node, question giving rise to 2 children nodes
- Internal node: one parent node, question giving rise to 2 children nodes
- Leaf: one parent node, no children nodes --> Prediction

Information Gain

The nodes of a classification tree are grown recursively;
- The obtention of an internal node or a leaf depends on the state of its predecessors. To produce the purest leafs possible, at each node, a tree asks a question involving one feature f and a split-point sp.
The tree aims at maximizing the Information Gain obtained after each split.
Criteria to measure the impurity of a node I
- gini index,
- entropy ...
- Most of the time, the gini index and entropy lead to the same results. The gini index is slightly faster to compute and is the default criterion used in the DecisionTreeClassifier model of scikit-learn.

Classification-Tree Learning

Nodes are grown recursively.
At each node, split the data based on:
- feature f and split-point sp to maximize IG(node)
IG(node) = 0 , declare the node a leaf

# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)

03 Decision-Tree for Regression

Target is continous
Information Criterion for Regression-Tree
- MSE(node)
Prediction: mean y in the leaf
flexibility & capture non-linearity

# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8,
             min_samples_leaf=0.13,  #each leaf has to contain at least 13% of the training data
            random_state=3)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute y_pred
y_pred = dt.predict(X_test)

# Compute mse_dt
mse_dt = MSE(y_test, y_pred)

# Compute rmse_dt
rmse_dt = mse_dt**(1/2)

# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))  # 4.37, smaller than that of a linearRegression model