explanation each code line by line experiment 3b - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki

Aim

Train an SVM classifier on the Iris dataset using sklearn. Experiment with different kernels and associated hyperparameters. Train the model with the following set of hyperparameters:

Kernel: RBF
Gamma: 0.5
Decision function shape: One-vs-rest (ovr)
No feature normalization
Values of C: 0.01, 1, 10

Find the best classification accuracy and the total number of support vectors on the test data.

Code Explanation

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set of hyperparameters to try
hyperparameters = [
    {'kernel': 'rbf', 'gamma': 0.5, 'C': 0.01},
    {'kernel': 'rbf', 'gamma': 0.5, 'C': 1},
    {'kernel': 'rbf', 'gamma': 0.5, 'C': 10}
]

best_accuracy = 0
best_model = None
best_support_vectors = None

# Train SVM models with different hyperparameters and find the best accuracy
for params in hyperparameters:
    model = SVC(kernel=params['kernel'], gamma=params['gamma'], C=params['C'], decision_function_shape='ovr')
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    support_vectors = model.n_support_.sum()
    print(f"For hyperparameters: {params}, Accuracy: {accuracy}, Total Support Vectors: {support_vectors}")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model
        best_support_vectors = support_vectors

print("\nBest accuracy:", best_accuracy)
print("Total support vectors on test data:", best_support_vectors)

Explanation of Each Line

Importing the Required Libraries

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

sklearn.datasets: load_iris is a function that loads the Iris dataset.
sklearn.model_selection: train_test_split is used to split the data into training and testing sets.
sklearn.svm: SVC is used to create the Support Vector Machine (SVM) classifier.

Loading the Iris Dataset

iris = load_iris()
X = iris.data
y = iris.target

load_iris() loads the Iris dataset.
X contains the feature data (sepal length, sepal width, petal length, petal width).
y contains the target labels (species of iris flowers).

Splitting the Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_test_split(X, y, test_size=0.2, random_state=42) splits the data into training and testing sets.
test_size=0.2 indicates 20% of the data will be used for testing.
random_state=42 ensures reproducibility of the split.

Setting Hyperparameters to Try

hyperparameters = [
    {'kernel': 'rbf', 'gamma': 0.5, 'C': 0.01},
    {'kernel': 'rbf', 'gamma': 0.5, 'C': 1},
    {'kernel': 'rbf', 'gamma': 0.5, 'C': 10}
]

hyperparameters is a list of dictionaries, each containing a set of hyperparameters to try:
- kernel: Specifies the kernel type to be used in the algorithm. 'rbf' stands for Radial Basis Function.
- gamma: Kernel coefficient for 'rbf'.
- C: Regularization parameter.

Initializing Variables to Track the Best Model

best_accuracy = 0
best_model = None
best_support_vectors = None

best_accuracy stores the highest classification accuracy observed.
best_model stores the model that achieved the best accuracy.
best_support_vectors stores the total number of support vectors of the best model.

Training SVM Models with Different Hyperparameters

for params in hyperparameters:
    model = SVC(kernel=params['kernel'], gamma=params['gamma'], C=params['C'], decision_function_shape='ovr')
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    support_vectors = model.n_support_.sum()
    print(f"For hyperparameters: {params}, Accuracy: {accuracy}, Total Support Vectors: {support_vectors}")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model
        best_support_vectors = support_vectors

Iterates through each set of hyperparameters.
SVC(kernel=params['kernel'], gamma=params['gamma'], C=params['C'], decision_function_shape='ovr') initializes the SVM model with the specified hyperparameters.
model.fit(X_train, y_train) trains the model using the training data.
accuracy = model.score(X_test, y_test) calculates the accuracy of the model on the testing set.
support_vectors = model.n_support_.sum() calculates the total number of support vectors used by the model.
Prints the accuracy and the total number of support vectors for each set of hyperparameters.
Updates best_accuracy, best_model, and best_support_vectors if the current model has a higher accuracy than previously observed.

Printing the Best Accuracy and Total Number of Support Vectors

print("\nBest accuracy:", best_accuracy)
print("Total support vectors on test data:", best_support_vectors)

Prints the highest classification accuracy and the total number of support vectors of the best model.

Output

For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 0.01}, Accuracy: 0.3, Total Support Vectors: 120
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 1}, Accuracy: 1.0, Total Support Vectors: 39
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 10}, Accuracy: 1.0, Total Support Vectors: 31

Best accuracy: 1.0
Total support vectors on test data: 39

This indicates that the best accuracy is 1.0 (100%) with 39 support vectors when using the hyperparameters {'kernel': 'rbf', 'gamma': 0.5, 'C': 1}.

Viva Questions and Answers

What is an SVM classifier?

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the hyperplane that best divides a dataset into classes.

What is the Iris dataset?

The Iris dataset is a classic dataset in the field of machine learning and statistics. It contains 150 samples of iris flowers, with four features (sepal length, sepal width, petal length, petal width) and three target classes (species of iris flowers).

What does the `kernel` parameter in SVC do?

The kernel parameter specifies the kernel type to be used in the SVM algorithm. Common kernels include 'linear', 'poly', 'rbf', and 'sigmoid'. The 'rbf' kernel, or Radial Basis Function kernel, is a popular choice for non-linear data.

What is the purpose of the `gamma` parameter?

The gamma parameter defines how far the influence of a single training example reaches. Low values mean 'far' and high values mean 'close'. It determines the shape of the decision boundary.

What does the `C` parameter represent?

The C parameter is the regularization parameter. It controls the trade-off between achieving a low training error and a low testing error, which is essential for good generalization. A small value of C makes the decision surface smooth, while a large value of C aims to classify all training examples correctly.

Why do we use `train_test_split`?

train_test_split is used to split the dataset into a training set and a testing set. This allows us to train the model on one subset of the data and evaluate its performance on a separate subset, helping to prevent overfitting and ensuring that the model generalizes well to unseen data.

What does `decision_function_shape='ovr'` mean?

decision_function_shape='ovr' stands for "one-vs-rest". It means that the algorithm will fit one classifier per class, with each classifier trained to separate one class from all the others. This is useful for multi-class classification problems.

Why is it important to find the best hyperparameters for a model?

Finding the best hyperparameters for a model is crucial because they significantly impact the model's performance. Properly tuned hyperparameters can improve the model's accuracy, generalization, and overall effectiveness.

By addressing these questions, the student demonstrates an understanding of SVM classifiers, the importance of hyperparameter

tuning, and the practical application of machine learning techniques on the Iris dataset.