Lab 2 Wiki - KranthiKumarGangineni/Python GitHub Wiki
Team Member 1:
Name: Kranthi Kumar Gangineni
Mail Id: [email protected]
Class Id : 7
Contribution: Done all the 4 along with Team Partner
Team Member 2:
Name : Venkata Bhavesh Reddy Polareddy
Mail Id: [email protected]
Class Id : 26
Contribution: Done all the 4 along with Team Partner
TASK1 : Linear Discriminanant Analysis vs Logistic Regression
For this Task, We picked Digits Dataset and tried to analyze the dataset using Linear Regression Model, Logistic Regression and Linear Discriminant Analysis.
Code Snippets:
- Logistic Regression
Logistic Regression Accuracy is calculated based on the train_test_split and checked the accuracy of it.
def call_logistic_regression(Xtr, Xte, Ytr, Yte):
# Creating Instance
logRegression = LogisticRegression()
# Fitting the Data
logRegression.fit(Xtr, Ytr)
Ypred = logRegression.predict(Xte)
# Returning the Accuracy Score.
return metrics.accuracy_score(Yte, Ypred)
- Linear Discriminant Analysis
LDA is calculated based on same test data and verified how is the accuracy compared to Logistic Regression for the Digits Dataset.
def call_lda(Xtr, Xte, Ytr, Yte):
lda = LinearDiscriminantAnalysis()
lda.fit(Xtr, Ytr)
Ypred = lda.predict(Xte)
plot.title("LD Analysis Plot")
plot.scatter(Yte, Ypred, color='blue', linewidths=3)
plot.show()
return metrics.accuracy_score(Yte, Ypred)
Plot for LDA:
- Linear Regression Model
For Linear Regression Model, plotted the Data and calculated the Coeffecient and Mean Squared Error for the Dataset.
Plot for Linear Regression Model:
Final Output:
Differences:
Linear Discriminant Analysis is easier to Compute than that of Logistic Regression. Logistic Regression is more on getting odd ratio for the variables, where LDA is used for splitting into categories. LR accepts continuous as well as categorical predicted values whereas LDA accepts only continuous values.
When Applied on both the Models for Digits dataset, LR has more accuracy compared to LDA.
TASK2 : Support Vector Machine Classification
For this Task, We have chosen IRIS dataset:
Code Snippet:
clf = SVC(kernel='linear', C=1.0, gamma=0.1).fit(X_train, Y_train)
clf.fit(X_test, Y_test)
y_pred = clf.predict(X_test)
print("Linear Kernel Accuracy :", metrics.accuracy_score(Y_test, y_pred))
# Increasing Random State increases accuracy
"""
RBF Kernel
"""
clf1 = SVC(kernel='rbf', C=1.0, gamma=0.2).fit(X_train, Y_train)
clf1.fit(X_test, Y_test)
y_pred = clf1.predict(X_test)
print("RBF kernel Accuracy : ",metrics.accuracy_score(Y_test, y_pred))
# Plotting
plot.scatter(X[:, 0], X[:, 1], c=Y)
plot.show()
Output Plot for the Data and Target:
We used Linear vs RBF kernel for checking which have more accuracy and what parameters are affecting the accuracy. The Output Accuracies are like below:
Which is Better ?
Based on the Result, Linear Kernel have more accuracy compared to RBF kernel for the Iris Dataset. But the accuracy depends on different parameters like random_state, gamma parameter etc..
If there is change in random_state or gamma parameters, we can achieve more accuracy. & In some cases we have more accuracy for RBF kernel than Linear one for the same Iris dataset.
So, Linear Kernel Works when features are comparatively more and RBF works better when features are comparatively smaller.
TASK2 : Summarize Text File
For this, we calculated Lemmatization, Bigrams (Top5, Frequency), Sentence reading from text file etc..
Text File we read:
Code Snippet:
Lemmatization:
# Word Tokenization - To get each Word from the Text
tokens = word_tokenize(fileData)
# Applying Lemmatization on the Words
lemmatizer = WordNetLemmatizer()
lemmatizerOutput = []
print("Lemmatized Output : \n")
for tok in tokens:
# Iterating and Lemmatizing each word and Appending it to a list
lemmatizerOutput.append(lemmatizer.lemmatize(str(tok)))
print(lemmatizerOutput)
BiGrams:
print("Bigrams :\n")
bigramOutput = []
for big in ngrams(tokens, 2):
# Fetching Bigrams using 'ngrams' method and Iterating it
bigramOutput.append(big)
print(bigramOutput)
# BiGram- Word Frequency
# Using bigramOutput fetch the WordFreq Details
wordFreq = FreqDist(bigramOutput)
# Getting Most Common Words and Printing them - Will get the Counts from top to least
mostCommon = wordFreq.most_common()
print("BiGrams Frequency (From Top to Least) : \n", mostCommon)
# Fetching the Top 5 Bigrams
top5 = wordFreq.most_common(5)
print("Top 5 BiGrams : \n", top5)
Output:
Task 4: K Nearest Neighbour
For this Task, we tried to calculate the accuracy score with the Change in 'K' (nearest_neighbours)
For this, we imported all the Required Imports and Fitted data and created train_test_split data based on which we calculated the Accuracy Scores by change in 'K'
Code Snippet:
Output:
Analysis:
As the 'K' (nearest_neighbours) increases the model will fit better which leads to Low Variance. But if K is low, say 1 then the model will be overfit which leads to High Variance.
K-Value will have impact on the test data, that is why there will be variations in the accuracy.