Lab Assignment 2 - VineethaG/Python_CS5590 GitHub Wiki
Team ID: 12 Team Member 1: Vineetha Gummadi, Class ID: 10 Team Member 2: Amulya Kasaraneni, Class ID: 14
- To predict the model using Linear Discriminant Analysis and differentiating Logistic Regression and Linear Discriminant Analysis.
- To implement SVM classification with Linear kernel and RBF kernel methods
- To implement lemmatization and bigram on text using NLTK
- To apply K nearest neighbor algorithm(KNN) and analyze the accuracy with change of amount of K
I have considered iris data set available in scikit-learn library. Iris has 3 labels(classes) 'setosa', 'versicolor', 'virginica' and four features Sepal Length, Sepal Width, Petal Length and Petal Width. Partitioned the data as 60% train data and 40% test data. Applied Linear Discriminant Analysis model on training data and predicted the test label values based on the LDA classifier. Then calculated the accuracy of the model.
Both Logistic regression, Linear Discriminant Analysis are appropriate for linear classification technique, i.e. models associated with linear boundaries between the groups. Logistic regression is a classification algorithm limited to only two-class classification problems. It is often used when we aren’t even interested in categorization but in getting the probability ratios for each variable. If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique. In the case of iris data set, it has 3 classes.
Iris data set is chosen from scikit learn. Partitioned the data as 80% train data and 20% test data. Applied linear kernel and RBF kernel classification. Trained the model using both linear and RBF and then predicted the test values based on training models. Accuracy is calculated.
Accuracy for Linear kernel is 100% whereas for RBF kernel is 96.66%. It is noted that accuracy is high for Linear kernel.
As the data is linearly separated for iris dataset using linear kernel. Linear kernel SVM is a parametric model, an RBF kernel SVM is not. Accuracy depends on type of dataset and size of data. Also analyzed accuracy for digits data set. Accuracy for digits dataset using Linear kernel is 99.16% whereas using RBF kernel is 43.61%.
Imported the required resources from NLTK to perform NLP operations. Read the input text from file ‘input.txt’. For lemmatizing, tokenize the text into words and then apply lemmatize on each word. Applied the bigram on the input text. Calculated the bigram frequency using Counter() function. Top five bigrams are retrieved using most_common(). To retrieve sentences with most repeated bigrams list, if the word from most repeated bigrams present in for each sentence in sentence tokenizer then those sentences are concatenated.
KNN is supervised learning algorithms. The accuracy depends on the k value. A small value for K provides the most flexible fit. The boundary becomes smoother with increasing value of K. Below is the accuracy curve for k range 1-50. From the below graph we can say that the optimal k value is in range of 5-25. Therefore we can find the best k value by validate error curve.
https://www.kdnuggets.com/2016/06/select-support-vector-machine-kernels.html
http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html