explanation each code line by line experiment 4 - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki
Aim
Demonstrate the working of the Decision Tree based ID3 algorithm using the given dataset.
Code Explanation
Here's the step-by-step explanation of the provided code:
Import Libraries
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from io import StringIO
from IPython.display import Image
import pydotplus
DecisionTreeClassifier
: Classifier for decision trees in scikit-learn.export_graphviz
: Exports a decision tree in DOT format for visualization.train_test_split
: Splits the dataset into training and testing sets.accuracy_score
: Calculates the accuracy of the classifier.pandas
: Library for data manipulation and analysis.StringIO
: In-memory stream for string operations.Image
: Displays images in IPython/Jupyter notebooks.pydotplus
: A Python interface to Graphviz for creating visualizations.
Define and Prepare the Dataset
data = {
'Price': ['Low', 'Low', 'Low', 'Low', 'Low', 'Med', 'Med', 'Med', 'Med', 'High', 'High', 'High', 'High'],
'Maintenance': ['Low', 'Med', 'Low', 'Med', 'High', 'Med', 'Med', 'High', 'High', 'Med', 'Med', 'High', 'High'],
'Capacity': ['2', '4', '4', '4', '4', '4', '4', '2', '5', '4', '2', '2', '5'],
'Airbag': ['No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes'],
'Profitable': [1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)
data
: Dictionary representing the dataset.df
: DataFrame created from the dictionary.
Convert Categorical Variables into Numerical Ones
df = pd.get_dummies(df, columns=['Price', 'Maintenance', 'Airbag'])
pd.get_dummies
: Converts categorical variables into dummy/indicator variables (one-hot encoding).
Separate Features and Target Variable
X = df.drop('Profitable', axis=1)
y = df['Profitable']
X
: Features excluding the target variable.y
: Target variable (Profitable).
Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train_test_split
: Splits the dataset into training and testing sets.test_size=0.2
: 20% of the data is used for testing.random_state=42
: Ensures reproducibility of the split.
Create and Train the Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)
DecisionTreeClassifier
: Creates a decision tree classifier with the ID3 algorithm (using entropy as the criterion).clf.fit(X_train, y_train)
: Trains the classifier on the training data.
Predict and Evaluate the Model
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
clf.predict(X_test)
: Predicts the target values for the test data.accuracy_score(y_test, y_pred)
: Calculates the accuracy of the predictions.print("Accuracy:", accuracy)
: Prints the accuracy of the classifier.
Visualize the Decision Tree
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names=X.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
StringIO
: Creates an in-memory stream to hold the DOT data.export_graphviz
: Exports the trained decision tree to DOT format.pydotplus.graph_from_dot_data
: Converts the DOT data to a graph.Image(graph.create_png())
: Displays the decision tree as an image.
Output
Accuracy: 0.6666666666666666
The accuracy of the model is approximately 66.67%.
Viva Questions and Answers
What is the ID3 algorithm?
- The Iterative Dichotomiser 3 (ID3) algorithm is used to create a decision tree by recursively selecting the attribute that provides the highest information gain or lowest entropy. It is a form of the decision tree learning algorithm.
What is entropy in the context of decision trees?
- Entropy is a measure of uncertainty or impurity in a dataset. In decision trees, entropy helps in deciding which feature to split on by calculating the information gain. Lower entropy indicates a more pure node.
What is the purpose of one-hot encoding in this dataset?
- One-hot encoding converts categorical variables into a format that can be provided to ML algorithms to do a better job in prediction. It creates binary columns for each category.
train_test_split
in machine learning?
Why do we use train_test_split
is used to split the dataset into training and testing sets to evaluate the performance of the model. This helps in assessing how well the model generalizes to unseen data.
criterion='entropy'
parameter in DecisionTreeClassifier
?
What is the - The
criterion
parameter specifies the function to measure the quality of a split. 'entropy' uses information gain to select the best feature for splitting.
How can you interpret the decision tree visualization?
- The decision tree visualization shows the splits made by the model at each node based on different features. It helps to understand how the model makes decisions and the importance of each feature in making those decisions.
What are some potential limitations of using a decision tree?
- Decision trees can be prone to overfitting, especially with complex trees. They may also be unstable with small variations in data, leading to different structures. Pruning or using ensemble methods like Random Forests can help mitigate these issues.