Lab 2 Wiki - KranthiKumarGangineni/Python GitHub Wiki

Team Member 1:

Name: Kranthi Kumar Gangineni

Mail Id: [email protected]

Class Id : 7

Contribution: Done all the 4 along with Team Partner

Team Member 2:

Name : Venkata Bhavesh Reddy Polareddy

Mail Id: [email protected]

Class Id : 26

Contribution: Done all the 4 along with Team Partner

TASK1 : Linear Discriminanant Analysis vs Logistic Regression


For this Task, We picked Digits Dataset and tried to analyze the dataset using Linear Regression Model, Logistic Regression and Linear Discriminant Analysis.

Code Snippets:

  1. Logistic Regression

Logistic Regression Accuracy is calculated based on the train_test_split and checked the accuracy of it.

def call_logistic_regression(Xtr, Xte, Ytr, Yte): # Creating Instance logRegression = LogisticRegression() # Fitting the Data logRegression.fit(Xtr, Ytr) Ypred = logRegression.predict(Xte) # Returning the Accuracy Score. return metrics.accuracy_score(Yte, Ypred)

  1. Linear Discriminant Analysis

LDA is calculated based on same test data and verified how is the accuracy compared to Logistic Regression for the Digits Dataset.

def call_lda(Xtr, Xte, Ytr, Yte): lda = LinearDiscriminantAnalysis() lda.fit(Xtr, Ytr) Ypred = lda.predict(Xte) plot.title("LD Analysis Plot") plot.scatter(Yte, Ypred, color='blue', linewidths=3) plot.show() return metrics.accuracy_score(Yte, Ypred)

Plot for LDA:

  1. Linear Regression Model

For Linear Regression Model, plotted the Data and calculated the Coeffecient and Mean Squared Error for the Dataset.

Plot for Linear Regression Model:

Final Output:

Differences:


Linear Discriminant Analysis is easier to Compute than that of Logistic Regression. Logistic Regression is more on getting odd ratio for the variables, where LDA is used for splitting into categories. LR accepts continuous as well as categorical predicted values whereas LDA accepts only continuous values.

When Applied on both the Models for Digits dataset, LR has more accuracy compared to LDA.

TASK2 : Support Vector Machine Classification


For this Task, We have chosen IRIS dataset:

Code Snippet:

clf = SVC(kernel='linear', C=1.0, gamma=0.1).fit(X_train, Y_train) clf.fit(X_test, Y_test) y_pred = clf.predict(X_test) print("Linear Kernel Accuracy :", metrics.accuracy_score(Y_test, y_pred))

# Increasing Random State increases accuracy

""" RBF Kernel """ clf1 = SVC(kernel='rbf', C=1.0, gamma=0.2).fit(X_train, Y_train) clf1.fit(X_test, Y_test) y_pred = clf1.predict(X_test) print("RBF kernel Accuracy : ",metrics.accuracy_score(Y_test, y_pred))

# Plotting plot.scatter(X[:, 0], X[:, 1], c=Y) plot.show()

Output Plot for the Data and Target:

We used Linear vs RBF kernel for checking which have more accuracy and what parameters are affecting the accuracy. The Output Accuracies are like below:

Which is Better ?


Based on the Result, Linear Kernel have more accuracy compared to RBF kernel for the Iris Dataset. But the accuracy depends on different parameters like random_state, gamma parameter etc..

If there is change in random_state or gamma parameters, we can achieve more accuracy. & In some cases we have more accuracy for RBF kernel than Linear one for the same Iris dataset.

So, Linear Kernel Works when features are comparatively more and RBF works better when features are comparatively smaller.

TASK2 : Summarize Text File


For this, we calculated Lemmatization, Bigrams (Top5, Frequency), Sentence reading from text file etc..

Text File we read:

Code Snippet:

Lemmatization:

# Word Tokenization - To get each Word from the Text tokens = word_tokenize(fileData)

# Applying Lemmatization on the Words lemmatizer = WordNetLemmatizer() lemmatizerOutput = [] print("Lemmatized Output : \n") for tok in tokens: # Iterating and Lemmatizing each word and Appending it to a list lemmatizerOutput.append(lemmatizer.lemmatize(str(tok))) print(lemmatizerOutput)

BiGrams:

print("Bigrams :\n") bigramOutput = [] for big in ngrams(tokens, 2): # Fetching Bigrams using 'ngrams' method and Iterating it bigramOutput.append(big) print(bigramOutput)

# BiGram- Word Frequency # Using bigramOutput fetch the WordFreq Details wordFreq = FreqDist(bigramOutput) # Getting Most Common Words and Printing them - Will get the Counts from top to least mostCommon = wordFreq.most_common() print("BiGrams Frequency (From Top to Least) : \n", mostCommon) # Fetching the Top 5 Bigrams top5 = wordFreq.most_common(5) print("Top 5 BiGrams : \n", top5)

Output:

Task 4: K Nearest Neighbour


For this Task, we tried to calculate the accuracy score with the Change in 'K' (nearest_neighbours)

For this, we imported all the Required Imports and Fitted data and created train_test_split data based on which we calculated the Accuracy Scores by change in 'K'

Code Snippet:

Output:

Analysis:

As the 'K' (nearest_neighbours) increases the model will fit better which leads to Low Variance. But if K is low, say 1 then the model will be overfit which leads to High Variance.

K-Value will have impact on the test data, that is why there will be variations in the accuracy.