Lab 1 Wiki Report - adtmv7/CS5590-490-Python-Deep-Learning GitHub Wiki

Report for Lab Assignment 1

Class ID - 24

I. Introduction

The Module-1 of CS5590 class provides the introduction to core data structures of the Python programming language. Focused was to explore and to use the python built-in data structure such as lists, dictionary, tuples, web scraping, Object Oriented Concepts and Scientific packages in Python. We also learned different Machine Algorithms and techniques such as Linear Regression, Logistic Regression, K-Mean Clustering, NLTK and different Classifications and learn data exploratory techniques. I used the learnt concept from Module-1 and developed the solutions for given problem in Lab-1

II. Objectives

Given a collection of integers that might contain duplicates, nums, return all possible subsets. Do not include null subset.
Concatenate two dictionaries and sort the concatenated dictionary by value.
Create a python program for Airline Booking Reservation System.
Go to https://catalog.umkc.edu/course-offerings/graduate/comp-sci/ and fetch the course name and overview of course using BeautifulSoup package.
Perform exploratory data analysis on the data set and plot different patterns(like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...) and apply the three classification algorithms Naïve Bayes, SVM and KNN on the chosen data set and report which classifier gives better result.
Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn. Report which K is the best using the elbow method and Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class).
Write a program in which take an Input file, use the simple approach below to summarize a text file, read the data from a file, tokenize the text into words and apply lemmatization technique on each word, find all the trigrams for the words, extract the top 10 of the most repeated trigrams based on their count, find all the sentences with the most repeated trigrams, extract those sentences and concatenate and print the concatenated result.
Create Multiple Regression by choosing a dataset of your choice. Evaluate the model using RMSE and R2 and report if you saw any improvement before and after the EDA.

III. Tools/Software:

PyCharm
Python3 Interpreter
Anaconda

IV. Datasets Used:

Boston Housing: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset

IV. Problems

Question 1: Given a collection of integers that might contain duplicates, nums, return all possible subsets. Do not include null subset.

Workflow/Approaches:

Create a list to store the values made available
Identify the number of elements the list shall contain from the user
Obtain the individual elements made available by the user
Evaluate each of the element provided by the user creating multiple lists with varying number of elements in each generating the following lists with no null values

Lists containing only one unique element each
Lists containing a combination of 2 elements from the input made available
List containing a combination of ‘n’ (all 3) elements of the input

Question 2: Concatenate two dictionaries and sort the concatenated dictionary by value.

2_1

Workflow/Approaches:

Create 2 dictionaries containing unique set of values
Merge the 2 dictionaries into one by appending the contents of the second dictionary into the first
Sort the concatenated dictionary to display its items in an ascending order of its values

2_2

Question 3: Create a python program for Airline Booking Reservation System.

We created the multiple super and sub-classes for Airline, Person, Flight, Passenger and Employee.
Then we defined the variables and functions within each class and classify them as public or private.
We created a main function that calls for the flight booking system which presents the pilot and passenger information as output.

3_1 3_2 3_3 3_4

Question 4: Go to https://catalog.umkc.edu/course-offerings/graduate/comp-sci/ and fetch the course name and overview of course using BeautifulSoup package.

4_1

Workflow/Approaches:

The goal is to extract the course name and the overview of its description from the catalog available online.
Identify the URL of the course whose information needs to be extracted
Import the appropriate library to parse the information from the website
Apply the ‘findAll’ function to identify the information required
Identify the title and description sections available from the information obtained to be displayed

4_2

Question 5: Perform exploratory data analysis on the data set and plot different patterns(like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...) and apply the three classification algorithms Naïve Bayes, SVM and KNN on the chosen data set and report which classifier gives better result?.

5_1 5_2 5_3 5_4

Workflow/Approaches:

The process of achieving the above goals is as follows.
Import the appropriate libraries as needed
Identify the data set (Boston data set) to be used and read the same
Evaluate the distribution of the data identifying the skewness associated with it
Normalization of the distribution of the data spread considering it is highly skewed
Identifying the features that are highly (positively) correlated to the target variable versus the ones that are negatively correlated
Visualization of the median valuation of the homes identified using the pivotal features identified using a correlation factor having a value of greater than 0.5
Identification of the null values within the data set and managing the same
Identification of the feature predictors (x) and target (y) variables from the data set
Splitting the data set into training and test sets for fitting the model and evaluating its accuracy
Define the model to be used such as Gaussian Naïve Bayes/SVM/KNN
Using the training data set, fit the appropriate model
Using the test data set, evaluate the performance of the model fit based on the mean squared error computed for the same
Evaluation of the error values across the 3 different methods and identifying the more efficient method having the lowest value of error

5_1 5_2 5_3

5_5 5_6

Question 6: Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn. Report which K is the best using the elbow method and Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class)?

6_1 6_2 6_3 6_4

Workflow/Approaches:

Import the appropriate libraries as needed
Identify the data set (Boston housing data set) to be used and read the same
Identification of the feature predictors (x) and target (y) variables from the data set
Identification of the features that are highly (positively) correlated to the target variable versus the ones that are negatively correlated
Identification of the null values within the data set and managing the same
Identification and visualization of the clusters to be formed or generated from the given data set
Identification of the optimum number of clusters to be generated from the data set using the Elbow point method
Processing of the data to standardize (normalize) its distribution
Defining the number of clusters along the K-Means method to be used
Evaluate the performance of the cluster analysis using the feature predictors
Update the number of clusters to be generated and identify the score obtained correspondingly
Evaluate the scores obtained across different number of clusters and identify the optimal clusters maximizing the performance using K-Means clustering

6_1 6_2 6_3 6_4 6_5

6_5 6_6

Question 7: Write a program in which take an Input file, use the simple approach below to summarize a text file, read the data from a file, tokenize the text into words and apply lemmatization technique on each word, find all the trigrams for the words, extract the top 10 of the most repeated trigrams based on their count, find all the sentences with the most repeated trigrams, extract those sentences and concatenate and print the concatenated result?

7_1

7_2

Workflow/Approaches:

The primary objective is to summarize text data made available as a text file. The process of text summarization involved the following steps.
Identify and read the input data available in the form of a text file
Perform tokenization on the text data resulting in the splitting of the entire data into individual words (terms)
Perform lemmatization on the tokens identified to derive the root word of each of the tokens identified
Identify the list of trigrams (3 consecutive words) present within the data
Evaluate the trigrams resulting in the identification of the top 10 trigrams that have appeared most frequently
Evaluate the input data to identify the sentences containing the top 10 trigrams
Extract the sentences identified containing the most frequented tri-grams
Concatenate the sentences identified and display the result obtained

7_3 7_4 7_5 7_6 7_7

Question 8: Create Multiple Regression by choosing a dataset of your choice. Evaluate the model using RMSE and R2 and report if you saw any improvement before and after the EDA?

8_1 8_2

Workflow/Approaches:

The primary objective is to perform multiple regression on a data set of our choice and evaluate its performance.
Import the appropriate libraries as needed
Identify the data set (Boston data set) to be used and read the same
Identification of the feature predictors (x) and target (y) variables from the data set
Splitting the data set into training and test sets for fitting the model and evaluating its accuracy
Define the regression model to be used
Using the training data set, fit the appropriate model
Using the test data set, evaluate the performance of the model fit based on the mean squared error computed for the same
Now, identify the null values present within the data set and handle the same using the mean value of the feature associated to
Using the training data set, fit the appropriate model on the revised data set with the handling of the null values
Using the test data set, evaluate the performance of the model fit based on the mean squared error computed for the same
Evaluate the performance of the model with and without the nulls values processed within the data set

8_3 8_4