Lab 1 Wiki Report - adtmv7/CS5590-490-Python-Deep-Learning GitHub Wiki

Report for Lab Assignment 1

Class ID - 24

I. Introduction

The Module-1 of CS5590 class provides the introduction to core data structures of the Python programming language. Focused was to explore and to use the python built-in data structure such as lists, dictionary, tuples, web scraping, Object Oriented Concepts and Scientific packages in Python. We also learned different Machine Algorithms and techniques such as Linear Regression, Logistic Regression, K-Mean Clustering, NLTK and different Classifications and learn data exploratory techniques. I used the learnt concept from Module-1 and developed the solutions for given problem in Lab-1

II. Objectives

  1. Given a collection of integers that might contain duplicates, nums, return all possible subsets. Do not include null subset.

  2. Concatenate two dictionaries and sort the concatenated dictionary by value.

  3. Create a python program for Airline Booking Reservation System.

  4. Go to https://catalog.umkc.edu/course-offerings/graduate/comp-sci/ and fetch the course name and overview of course using BeautifulSoup package.

  5. Perform exploratory data analysis on the data set and plot different patterns(like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...) and apply the three classification algorithms Naïve Bayes, SVM and KNN on the chosen data set and report which classifier gives better result.

  6. Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn. Report which K is the best using the elbow method and Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class).

  7. Write a program in which take an Input file, use the simple approach below to summarize a text file, read the data from a file, tokenize the text into words and apply lemmatization technique on each word, find all the trigrams for the words, extract the top 10 of the most repeated trigrams based on their count, find all the sentences with the most repeated trigrams, extract those sentences and concatenate and print the concatenated result.

  8. Create Multiple Regression by choosing a dataset of your choice. Evaluate the model using RMSE and R2 and report if you saw any improvement before and after the EDA.

III. Tools/Software:

  • PyCharm
  • Python3 Interpreter
  • Anaconda

IV. Datasets Used:

IV. Problems

Question 1: Given a collection of integers that might contain duplicates, nums, return all possible subsets. Do not include null subset.

1

Workflow/Approaches:

  • Create a list to store the values made available
  • Identify the number of elements the list shall contain from the user
  • Obtain the individual elements made available by the user
  • Evaluate each of the element provided by the user creating multiple lists with varying number of elements in each generating the following lists with no null values
  • Lists containing only one unique element each
  • Lists containing a combination of 2 elements from the input made available
  • List containing a combination of ‘n’ (all 3) elements of the input

2

Question 2: Concatenate two dictionaries and sort the concatenated dictionary by value.

2_1

Workflow/Approaches:

  • Create 2 dictionaries containing unique set of values
  • Merge the 2 dictionaries into one by appending the contents of the second dictionary into the first
  • Sort the concatenated dictionary to display its items in an ascending order of its values

2_2

Question 3: Create a python program for Airline Booking Reservation System.

  • We created the multiple super and sub-classes for Airline, Person, Flight, Passenger and Employee.
  • Then we defined the variables and functions within each class and classify them as public or private.
  • We created a main function that calls for the flight booking system which presents the pilot and passenger information as output.

3_1 3_2 3_3 3_4

Question 4: Go to https://catalog.umkc.edu/course-offerings/graduate/comp-sci/ and fetch the course name and overview of course using BeautifulSoup package.

4_1

Workflow/Approaches:

  • The goal is to extract the course name and the overview of its description from the catalog available online.
  • Identify the URL of the course whose information needs to be extracted
  • Import the appropriate library to parse the information from the website
  • Apply the ‘findAll’ function to identify the information required
  • Identify the title and description sections available from the information obtained to be displayed

4_2

Question 5: Perform exploratory data analysis on the data set and plot different patterns(like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...) and apply the three classification algorithms Naïve Bayes, SVM and KNN on the chosen data set and report which classifier gives better result?.

5_1 5_2 5_3 5_4

Workflow/Approaches:

  • The process of achieving the above goals is as follows.
  • Import the appropriate libraries as needed
  • Identify the data set (Boston data set) to be used and read the same
  • Evaluate the distribution of the data identifying the skewness associated with it
  • Normalization of the distribution of the data spread considering it is highly skewed
  • Identifying the features that are highly (positively) correlated to the target variable versus the ones that are negatively correlated
  • Visualization of the median valuation of the homes identified using the pivotal features identified using a correlation factor having a value of greater than 0.5
  • Identification of the null values within the data set and managing the same
  • Identification of the feature predictors (x) and target (y) variables from the data set
  • Splitting the data set into training and test sets for fitting the model and evaluating its accuracy
  • Define the model to be used such as Gaussian Naïve Bayes/SVM/KNN
  • Using the training data set, fit the appropriate model
  • Using the test data set, evaluate the performance of the model fit based on the mean squared error computed for the same
  • Evaluation of the error values across the 3 different methods and identifying the more efficient method having the lowest value of error

5_1 5_2 5_3

5_5 5_6

Question 6: Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn. Report which K is the best using the elbow method and Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class)?

6_1 6_2 6_3 6_4

Workflow/Approaches:

  • Import the appropriate libraries as needed
  • Identify the data set (Boston housing data set) to be used and read the same
  • Identification of the feature predictors (x) and target (y) variables from the data set
  • Identification of the features that are highly (positively) correlated to the target variable versus the ones that are negatively correlated
  • Identification of the null values within the data set and managing the same
  • Identification and visualization of the clusters to be formed or generated from the given data set
  • Identification of the optimum number of clusters to be generated from the data set using the Elbow point method
  • Processing of the data to standardize (normalize) its distribution
  • Defining the number of clusters along the K-Means method to be used
  • Evaluate the performance of the cluster analysis using the feature predictors
  • Update the number of clusters to be generated and identify the score obtained correspondingly
  • Evaluate the scores obtained across different number of clusters and identify the optimal clusters maximizing the performance using K-Means clustering

6_1 6_2 6_3 6_4 6_5

6_5 6_6

Question 7: Write a program in which take an Input file, use the simple approach below to summarize a text file, read the data from a file, tokenize the text into words and apply lemmatization technique on each word, find all the trigrams for the words, extract the top 10 of the most repeated trigrams based on their count, find all the sentences with the most repeated trigrams, extract those sentences and concatenate and print the concatenated result?

7_1

7_2

Workflow/Approaches:

  • The primary objective is to summarize text data made available as a text file. The process of text summarization involved the following steps.
  • Identify and read the input data available in the form of a text file
  • Perform tokenization on the text data resulting in the splitting of the entire data into individual words (terms)
  • Perform lemmatization on the tokens identified to derive the root word of each of the tokens identified
  • Identify the list of trigrams (3 consecutive words) present within the data
  • Evaluate the trigrams resulting in the identification of the top 10 trigrams that have appeared most frequently
  • Evaluate the input data to identify the sentences containing the top 10 trigrams
  • Extract the sentences identified containing the most frequented tri-grams
  • Concatenate the sentences identified and display the result obtained

7_3 7_4 7_5 7_6 7_7

Question 8: Create Multiple Regression by choosing a dataset of your choice. Evaluate the model using RMSE and R2 and report if you saw any improvement before and after the EDA?

8_1 8_2

Workflow/Approaches:

  • The primary objective is to perform multiple regression on a data set of our choice and evaluate its performance.
  • Import the appropriate libraries as needed
  • Identify the data set (Boston data set) to be used and read the same
  • Identification of the feature predictors (x) and target (y) variables from the data set
  • Splitting the data set into training and test sets for fitting the model and evaluating its accuracy
  • Define the regression model to be used
  • Using the training data set, fit the appropriate model
  • Using the test data set, evaluate the performance of the model fit based on the mean squared error computed for the same
  • Now, identify the null values present within the data set and handle the same using the mean value of the feature associated to
  • Using the training data set, fit the appropriate model on the revised data set with the handling of the null values
  • Using the test data set, evaluate the performance of the model fit based on the mean squared error computed for the same
  • Evaluate the performance of the model with and without the nulls values processed within the data set

8_3 8_4