explanation each code line by line experiment 3 - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki
Experiment - 3
Aim
Train a regularized logistic regression classifier on the Iris dataset (available at UCI Machine Learning Repository or the inbuilt Iris dataset) using sklearn. Train the model with the hyperparameter C = 1e4 and report the best classification accuracy.
Code Explanation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a pipeline with StandardScaler and LogisticRegression with regularization
pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))
# Train the model
pipeline.fit(X_train, y_train)
# Calculate the accuracy on the testing set
accuracy = pipeline.score(X_test, y_test)
print("Classification accuracy:", accuracy)
Explanation of Each Line
Importing the Required Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
- sklearn.datasets:
load_iris
is a function that loads the Iris dataset. - sklearn.model_selection:
train_test_split
is used to split the data into training and testing sets. - sklearn.linear_model:
LogisticRegression
is used to create the logistic regression model. - sklearn.preprocessing:
StandardScaler
is used to standardize the features by removing the mean and scaling to unit variance. - sklearn.pipeline:
make_pipeline
is used to create a pipeline that sequentially applies a list of transforms and a final estimator.
Loading the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target
load_iris()
loads the Iris dataset.X
contains the feature data (sepal length, sepal width, petal length, petal width).y
contains the target labels (species of the Iris flower).
Splitting the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train_test_split(X, y, test_size=0.2, random_state=42)
splits the data into training and testing sets.test_size=0.2
indicates 20% of the data will be used for testing.random_state=42
ensures reproducibility of the split.
Creating a Pipeline with StandardScaler and LogisticRegression
pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))
make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))
creates a pipeline that first standardizes the data usingStandardScaler
and then appliesLogisticRegression
with a regularization parameterC=1e4
and a maximum iteration limit of 1000.
Training the Model
pipeline.fit(X_train, y_train)
pipeline.fit(X_train, y_train)
trains the logistic regression model using the training data.
Calculating the Accuracy on the Testing Set
accuracy = pipeline.score(X_test, y_test)
print("Classification accuracy:", accuracy)
pipeline.score(X_test, y_test)
calculates the accuracy of the model on the testing set.print("Classification accuracy:", accuracy)
prints the classification accuracy.
Output
Classification accuracy: 1.0
This indicates the model has achieved perfect accuracy on the testing set.
Viva Questions and Answers
What is logistic regression?
- Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
What is the Iris dataset?
- The Iris dataset is a classic dataset in the field of machine learning and statistics. It contains 150 samples of iris flowers, with four features (sepal length, sepal width, petal length, petal width) and three target classes (species of iris flowers).
train_test_split
function do?
What does the train_test_split
splits the dataset into training and testing sets. It ensures that the model is trained on one subset of data and evaluated on a separate subset to prevent overfitting.
StandardScaler
in the pipeline?
Why do we use StandardScaler
standardizes features by removing the mean and scaling to unit variance. It is important to standardize the data to improve the performance and convergence speed of the logistic regression model.
C
in LogisticRegression
represent?
What does the parameter - The parameter
C
inLogisticRegression
represents the inverse of regularization strength. A smaller value ofC
means stronger regularization. In this case,C=1e4
indicates weak regularization.
random_state=42
in train_test_split
?
What is the purpose of setting - Setting
random_state=42
ensures reproducibility of the data split. It guarantees that the same training and testing sets are created each time the code is run.
make_pipeline
?
Why do we use make_pipeline
is used to create a sequence of data transformations and a final estimator. It ensures that the same transformations are applied to the training and testing sets, making the code cleaner and more reliable.