explanation each code line by line experiment 3 - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki

Experiment - 3

Aim

Train a regularized logistic regression classifier on the Iris dataset (available at UCI Machine Learning Repository or the inbuilt Iris dataset) using sklearn. Train the model with the hyperparameter C = 1e4 and report the best classification accuracy.

Code Explanation

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and LogisticRegression with regularization
pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))

# Train the model
pipeline.fit(X_train, y_train)

# Calculate the accuracy on the testing set
accuracy = pipeline.score(X_test, y_test)
print("Classification accuracy:", accuracy)

Explanation of Each Line

Importing the Required Libraries

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
  • sklearn.datasets: load_iris is a function that loads the Iris dataset.
  • sklearn.model_selection: train_test_split is used to split the data into training and testing sets.
  • sklearn.linear_model: LogisticRegression is used to create the logistic regression model.
  • sklearn.preprocessing: StandardScaler is used to standardize the features by removing the mean and scaling to unit variance.
  • sklearn.pipeline: make_pipeline is used to create a pipeline that sequentially applies a list of transforms and a final estimator.

Loading the Iris Dataset

iris = load_iris()
X = iris.data
y = iris.target
  • load_iris() loads the Iris dataset.
  • X contains the feature data (sepal length, sepal width, petal length, petal width).
  • y contains the target labels (species of the Iris flower).

Splitting the Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • train_test_split(X, y, test_size=0.2, random_state=42) splits the data into training and testing sets.
  • test_size=0.2 indicates 20% of the data will be used for testing.
  • random_state=42 ensures reproducibility of the split.

Creating a Pipeline with StandardScaler and LogisticRegression

pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))
  • make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000)) creates a pipeline that first standardizes the data using StandardScaler and then applies LogisticRegression with a regularization parameter C=1e4 and a maximum iteration limit of 1000.

Training the Model

pipeline.fit(X_train, y_train)
  • pipeline.fit(X_train, y_train) trains the logistic regression model using the training data.

Calculating the Accuracy on the Testing Set

accuracy = pipeline.score(X_test, y_test)
print("Classification accuracy:", accuracy)
  • pipeline.score(X_test, y_test) calculates the accuracy of the model on the testing set.
  • print("Classification accuracy:", accuracy) prints the classification accuracy.

Output

Classification accuracy: 1.0

This indicates the model has achieved perfect accuracy on the testing set.

Viva Questions and Answers

What is logistic regression?

  • Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

What is the Iris dataset?

  • The Iris dataset is a classic dataset in the field of machine learning and statistics. It contains 150 samples of iris flowers, with four features (sepal length, sepal width, petal length, petal width) and three target classes (species of iris flowers).

What does the train_test_split function do?

  • train_test_split splits the dataset into training and testing sets. It ensures that the model is trained on one subset of data and evaluated on a separate subset to prevent overfitting.

Why do we use StandardScaler in the pipeline?

  • StandardScaler standardizes features by removing the mean and scaling to unit variance. It is important to standardize the data to improve the performance and convergence speed of the logistic regression model.

What does the parameter C in LogisticRegression represent?

  • The parameter C in LogisticRegression represents the inverse of regularization strength. A smaller value of C means stronger regularization. In this case, C=1e4 indicates weak regularization.

What is the purpose of setting random_state=42 in train_test_split?

  • Setting random_state=42 ensures reproducibility of the data split. It guarantees that the same training and testing sets are created each time the code is run.

Why do we use make_pipeline?

  • make_pipeline is used to create a sequence of data transformations and a final estimator. It ensures that the same transformations are applied to the training and testing sets, making the code cleaner and more reliable.