Lab assignment #1 - SaitejaswiK/CSEE5590_Python-DL_Lab GitHub Wiki

Team #5

Name: Saitejaswi K

Class ID: 16

Name: Kartheek Katta

Class ID: 12

Name: Aanchal Tiwari

Class ID: 38

Objective:

The main objective of this lab tutorial is to get the basic understanding of python and some concepts of machine learning algorithms i.e., classification algorithms like SVM, Naive Bayes, K-Nearest Neighbors, Regression algorithms like multiple linear regression, Clustering techniques like KMeans Clustering, along with their corresponding metrics.

Technologies/IDE's used:

  • Python 3.7
  • Pycharm IDE

Libraries used:

  • Numpy
  • Pandas
  • Scikit learn library

Workflow:

The workflow of the any machine learning algorithm in this lab assignment is as follows:

  • As per the workflow, initially, a dataset is taken, pre-processing on the dataset like calculating the null values, replacing the obtained null values, correlation in between features are performed.
  • Next step is splitting the whole dataset into two different parts namely training and testing, where the model is trained based on the trained dataset and it is being validated on the test data.
  • Now, the model is being fit, and corresponding metrics are calculated.

Program 1:

Suppose you have a list of tuples as follows:

[( ‘John’, (‘Physics’, 80)) , (‘ Daniel’, (‘Science’, 90)), (‘John’, (‘Science’, 95)), (‘Mark’,(‘Maths’, 100)), (‘Daniel’, (’History’, 75)), (‘Mark’, (‘Social’, 95))]

Create a dictionary with keys as names and values as list of (subjects, marks)in sorted order.

{John : [(‘Physics’, 80), (‘Science’, 95)]Daniel : [ (’History’, 75), (‘Science’, 90)]Mark : [ (‘Maths’, 100), (‘Social’, 95)]}

Python code:

  • Initially, the given data in the form of tuples is taken into a list and initializing an empty dictionary.
  • Now, iterating through the list adding the items into the dictionary in the desired way and finally printing the dictionary.

Output:

Program 2:

Given a string, find the longest substrings without repeating characters along with the length as a tuple Input:

"pwwkew" Output: (wke,3), (kew,3)

Python code:

  • Initially, create a temp string where we store all the non-repeating characters.
  • Now, iterate through the length of the input and check whether the particular character is in the temp and if not add that character.
  • And finally print the longest substrings.

Output:

Program 3:

Write a python program to create any one of the following management systems.

1. Airline Booking Reservation System (e.g. classes Flight, Person, Employee, Passenger etc.)

2. Library Management System(e.g: Student, Book, Faculty, Department etc.)

  • I have created the Library management system, which has 5 different classes namely Person, Student, Librarian, Book, Borrow_book.
  • Person is the main class and the classes Student and Librarian have inherited (single inheritance) the Person class.
  • Borrow_Book class implements multiple inheritance with base class Student, Book.
  • Declared private data member StudentCount in student class to count number of student objects created.
  • Used super call in class Librarian to initialize Person class object.
  • Used a private member __numBooks for keeping the track of books.

Python code:

The main class(Person class) is as follows:

Concept of inheritance by Student class:

Super call in the librarian Class:

Creating a private number:

Multiple inheritance concept for Borrow_book class:

Creating instances:

Output:

Program 4:

Create Multiple Regression by choosing a dataset of your choice (again before evaluating, clean the data set with the EDA learned in the class). Evaluate the model using RMSE and R2 and also report if you saw any improvement before and after the EDA.

The dataset I have choosen is part of sklearn.datasets.load_diabetes

Dataset link: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes

Python code:

Importing the necessary libraries:

Creating the features and the target:

So, there is no null value.

  • encoding the categorical features From the result above, we do not have to encode any data in this dataset to numeric.

  • removing the features not correlated to the target class

The first five features are the most positively correlated with target and the next five are the most negatively correlated.

Correlating the features:

Output:

  • Based on this result, the festures bmi, s5, bp, s4, s6, s3 should be considered more (cause their score is more than 0.3)

Pivot table for the for the s1, age, s2, sex. Because we are still thinking whether keep them in to the model.

Creating the pivot plots:

Output of pivot plots: s1

age

s2

From the plots, these features are not correlated with target, so we should take them out of the dataset.

Splitting the data in terms of train and split:

Fitting the model and calculating the scores:

Output before eliminating the features:

Output after eliminating the features:

We can observe a slight increase in the score as because we are eliminating the data that are less correlated to the target variable.

Program 5:

Pick any dataset from the dataset sheet in the class sheet or online which includes both numeric and non-numeric features

a. Perform exploratory data analysis on the data set (like Handling null values, removing the features not correlated to the target class, encoding the categorical features, ...)

b. Apply the three classification algorithms Naïve Baye’s, SVM and KNN on the chosen data set and report which classifier gives better result.

Dataset: Car Evaluation Data Set from UC Irvine Machine Learning Repository

Dataset Link: http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Features:

buying - v-high, high, med, low

maint - v-high, high, med, low

doors - 2, 3, 4, 5-more

persons - 2, 4, more

lug_boot - small, med, big

safety - low, med, high

Target variable:

class values - unacc, acc, good, vgood

Python code:

Importing required libraries

Output:

Label encoder:

So, there is non-null value in the dataset need to be handle.

  • encoding the categorical features There is all object type data in dataset, even for doors and person, which are already numbers in dataset.
  • So, I transform them into numeric features.
  • Based on the dataset features, I assign the number to them like: number 0,1,2,3 to low, med, high, v-high in feature buying number 0,1,2,3 to low, med, high, v-high in feature maint number 6 to 5-more in feature doors number 6 to more in feature persons number 0,1,2 to small, med, big in feature lug_boot number 0,1,2 to low, med, high in feature safety And target number 0,1,2,3 to unacc, acc, good, vgood in feature class value

Correlation:

  • removing the features not correlated to the target class
  • In this part, I calculated the correlation betweem each feature and target.(numeric data)

Output plots of correlating features:

Creating features and target variable:

Gaussian NB Model:

SVM Model:

KNN Model:

Output scores of all the models:

From the result, we can see that the SVM model is the best with highest accuracy for both training dataset and testing dataset!

Program 6:

Choose any dataset of your choice. Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn.

a. Report which K is the best using the elbow method.

b. Evaluate with silhouette score or other scores relevant for unsupervised approaches (before applying clustering clean the data set with the EDA learned in the class)

Python code:

Importing libraries:

Checking nulls:

Output:

Creating features and target:

Elbow Method:

Output:

Performing PCA:

Output of clustering:

Before PCA result:

After PCA score and result: