LAB_1 - ntihindukkipati/CS5590_Python_DL GitHub Wiki

CS5590 APS -Python Programming

LAB1

Dukkipati, Sri Sai Nithin Chowdary – 4
Inakollu, Sri Naga Bhuvaneswari - 9
Kolluri, Nikhitha – 12

INTRODUCTION:

In this lab we will go through some basics of python, use BeautifulSoup package for web scrapping, plotting patterns using some classification algorithms, applying tokenization, lemmatization and trigrams on some file, creating some multiple regression and evaluating the model using RMSE and R2 techniques. Cleaning all the dataset before its usage.

OBJECTIVES:

Some objectives of this lab are:
• Basics of python such as performing some operations on subsets, concatenating two dictionaries and some operations on it.
• Airline Booking Reservation System which takes in the details of customer, his preferences and then display the available airlines from which the customer can select source, destination, the type of airlines he wish to fly in, the class (business or economy) and number of bags he wish to check-in . The system should display the fare for the ticket and prints the ticket. The system should display all the details of the employee that is booking the ticket.
• Use the BeautifulSoup package and from the school catalogue site get all the course names and their description.
• Take a dataset which will have both numeric and non-numeric data, perform data analysis on the chosen dataset and plot patterns. Before plotting patterns remove all the null values from the data set, remove all the features which are unrelated to the target class and all the categorical features should be encoded. Report the classification of the data set based on some classification algorithms such as Naïve Bayes, SVM and KNN.
• Choose any dataset you wish and them perform K-means on that and then visualize the data using matplotlib or seaborn. Select the best K by applying elbow method, evaluate the silhouette score. Before evaluating the score, clean the data set with EDA.
• Take an input file which has few sentences and apply tokenization, lemmatization, trigrams, most repeated trigrams, extract few sentences and concatenate them and print the result. • Choose a data set of your choice, create multiple regression. Use RMSE and R2 to evaluate the model, display the improved results after EDA.

APPROACHES/METHODS:

1. Return all possible subsets for a given collection of integers without including the null subset.
 a. First we will be accepting the length of the integers from the user.
 b. Then we will ask them to input the integers for the set using a for loop and append all these integers to a list.
 c. Then using itertools.combinations we will get all the subsets.
 d. If you apply set on this combinations we can eliminate the duplicates.
 e. We use a loop to get all the subsets of a given set.
1

2. Concatenate two dictionaries and sort them based on the values.
 a. First declare and define two dictionaries.
 b. Merge the dictionaries using the update method.
 c. Sorting the merged dictionary by using sorted method which takes lambda function and sorts based on the value field.
 d. We use a for loop to print all the key value pairs in sorted order.

2

3. Airline Booking Reservation System
 a. First we have taken a flight class which will take some inputs from user, like source, destination, airlines, class and will generate the flight number. It also has a method that displays the details.
 b. Employee class will display all the details of the employee. It uses the concept of inheritance and overrides the printing method of the Flight class.
 c. Passenger class will take all the details of the passenger.
 d. Baggage class would take the number of check-in bags from the passenger and calculate the baggage fare.
 e. TicketCost class takes the inputs from all the classes and would calculate the fare of the ticket. A method in TicketCost would display all the ticket details.

3 4 5 6

4. WebScrapping using BeautifulSoup package
 a. First we would get the html of the given URL using BS4 package
 b. Then convert all the html into plain text.
 c. We find all the div tags with class=’courseblock’.
 d. We then run a loop to get all the course names and course descriptions.
 e. We then get the text only without the tags using the text option.
 f. Finally we display the result.
7

8 9

5. SVM, KNN, Naïve Bayes classifier
5a.
 a. Took the data glass_type.cvs file which consists of both numeric and non-numberic data
 b. Used label label encoder to change non numeric data to numeric one
 c. Used fillna to d=fill all the null values with mean data(adding noise)
 d. Used spilt function for splitting the data into test and training data

Screenshot (507) Screenshot (508)

5b.
 e. Called the class svm, gaussianNb, KNeihborsClassifier.
  f. Fitted the model with training data
 g. Predicted the accuracy with score function with test data
Screenshot (509)

We got more accuracy with KNN where k=3

6. K-means
6a.

 a. Used cc.csv data set
 b. Visualized data using bar graph with seaborn library
 c. Extracted columns that are required
 d. Cleaned the data, used fillna to fill all null values(fitting noise data)
 e. Used StandardSCaler function for converting the data into numeric form
 f. Used datafram method to convert the data again back to dataframe
 g. Used KMeans clustering function to get best k value

Screenshot (510) Screenshot (511) Screenshot (512)

6b.
 h. plotted an elbow graph for visualizing the elbow where k lies
 i. Fitted a Knn model with the best k value ( which is n=3)
 j. finally calculated the silhouette score

Screenshot (513)

7Ans.
7a,b

 a. Read the given file nlp_input.txt with encoding UTF-8
 b. Used nltk library
 c. Used word_tokenize function for tokenizing each word
 d. Used lemmatize function to apply lemmatization on each word

Screenshot (514) Screenshot (515)

7c.
 e. Used ngram fuction for n=3 to get all trigrams
 f. Used FreqDist() class to find the all trigrams frequency
Screenshot (516)

7d.
 g. Used most_common(10) function for extracting the top 10 trigrams based on their frequency
Screenshot (517)

7e,f,g,h .
 h. Used sent_tokenize to get each sentence
 i. Iterated sentence with trigrams to get sentence with most trigrams with most frequency

Screenshot (518)

8. Multiple Regression (Calculating RMSE, and R2)
 a. Used winequality-red.csv file
 b. Dropped predicting quality from training data
 c. Used LinearRegression class to for fitting the model
 d. Calculated the RMSE score and R^2 score (before cleaning the data)
 e. Cleaned the data and filled all the null values with value
 f. Used fillna to fill all null values with mean value(adding noise)
 g. Fitted the model again with cleaned data
 h. calculated the RMSE and R^2 score (After cleaning data)
Screenshot (519) Screenshot (520) Screenshot (521)

Note:
Before cleaning the data

RMSE : 0.42
R^2 score: 0.35
After Cleaning the data
RMSE : 0.42
R^2 score: 0.36

DATASETS USED:

• Glass_type.csv
• Cc.csv
• Winequality-red.csv

EVALUATION & DISCUSSION:

  1. A set of integers were taken and all the possible subsets without the empty set were printed.
  2. Two dictionaries were taken and concatenation and sorting were applied to the merged dictionary.
  3. Airline Booking Reservation System was build using the concepts of class, inheritance and method overriding.
  4. By using the BeautifulSoup, webscrapping was performed, all the course names and their descriptions were displayed.
  5. Take random data set, clean data, split the data, fit and evaluate the accuracy for each of three classifiers (KNN, Naive Bayes, SVM).
  6. Take random data set, clean the data, use kmeans function for iterating through all values and get best k value, plot a graph to find elbow, fit model with best k value, and calculate the silhouette score
  7. Take the given text file, use the in-build functions to tokenize the data, lemmatize, Find all trigrams using ngrams function where n=3, freqDest() to get frequency, most_common(10) to get top ten trigrams and sentence tokenize to tokenize each sentence and get sentence with words with more frequency of trigrams.
  8. Take the random dataset clean the data, remove predicting column from training data set, fit the model, calculate the RMSE, R^2 score, and find the score without cleaning the data.
    CONCLUSION:
    All the programs were successfully executed according to the objectives specified and have met the evaluation criteria.
⚠️ **GitHub.com Fallback** ⚠️