Project 1 - Saiaishwaryapuppala/Python-Project-1 GitHub Wiki

PROJECT 1

TEAM – 5

INTRODUCTION:

Python group project with the below team:

Bindu Gadiparthi -49
Harinireddy Anumandla -03
Rajeshwari Sai Aishwarya Pupppala -35
koushik Reddy Sama -42
Sai Prasad Raju -07

Objective:

The target of this undertaking is to actualize the distinctive grouping models on the given datasets and picture the outcomes.

Approaches/Methods:

Both directed and solo strategies are associated with the methodologies of the project.

Workflow:

In this project, we will examine the dataset which contains 492 fakes out of 284,807 exchanges. Our target of this undertaking is to fit the dataset into any of our AI models (Random Forest classifiers, Logistic Regression, Decision Tree , and Naive Baye) so as to anticipate correctly while managing the exceptionally uneven dataset. Our subsequent stage is to manage the lopsided issue. We will utilize the over-inspecting strategy (SMOTE) to resample the dataset to make the quantities of cheats and typical exchanges even. The last advance is to think about the AI techniques and we found that Logistic Regression restored the most elevated AUC score.

TASK 1:

CODE:

Input code: Imported all the libraries and loaded the dataset using pandas. Considering a fraction of data to reduce computational time we reshaped the dataset and plotted it and the number of fraud and non fraud transactions are printed. Using correlation matrix plotted the heatmap and then we found the accuracy and the classifaction report is printed.
Output:

Visualising the number of transactions for each class.

The unbalanced would influence the precision of ordering the minor gathering since there was almost no information in the minor gathering. Thus, we chose to manage the unbalanced information by resampling the informational collection. The way toward taking care of unbalanced information included arbitrarily dropping out various occasions in the significant gathering and randomizing the two gatherings together to get another informational collection to prepare with. The outcome turned out with very positive scores.

Task 2

Applying K-means and clustering on the data set

Data Set

Description:

The data consists of information such as Customer ID , Annual Income, Spending score and gender. Considering these constraints we are applying clustering mainly on Annual Income and spending score.

Workflow:

At first we are clearing the data set removing all the nulls if present. Then focus on the data we require and elbow method to calculate the number of clusters is obtained using required constraints. The data is trained and 5 clusters are shown and plotted and silhouette score is calculated.

Objective

The elbow method shows the change in the slope was obtained from 5 clusters
The silhouette score of about 0.40 and the data did not need to be cleaned up after checking. A silhouette score of 0.40 indicates that the individual clusters are not too near or dense as a score of 1 would be interpreted as dense clusters and 0 would be too spcacious so the ideal value would be 0.5 for clustering.
The clustering results shows that the data was split well between 5 clustered groups. We have clustered the data set accordingly and displayed.

Code

Output

Task 3

Predicted the Temperature values using the remaining features like Humidity, Apparent Temperature, we have implemented the Linear Regression model to predict Temperature

Here first we have imported all the required libraries and explored the data, next we have checked null values and replaced null values with rain.

Next dropped the Summary, Daily Summary, Formatted Date, and Loud cover from the data(which are less relevant).

Building a Linear Regression Model

Model Implementation

Output:

Task 4

The objective of the fourth task is to clean the given dataset and evaluate the model by applying the different techniques TFIDF, Count_Vectorizer and then evaluate the best algorithm by analyzing the results

Code Explanation : Imported the required libraries Imported the given spam.csv and encoded it and then dropped the unnecessary columns And then described the data as shown in the below screen capture

Initialize the count vectorizer and fit the data And then get into a variable word_count_vector

Using TfidTransformer algorithm got the most frequently used words in the document Idf_weights gives the least value for the most frequently used word

Transform spam_data.Text into count_vector Declared a variable ‘h’ to loop through the document to get the frequency of most important words Sort the values by tfidf Restricted loop to 3 iterations as there are a lot of rows in the document

Finally trained and tested the data Split the data by Text and Class Applied three algorithms CountVectorizer, TfidTransformer, MultinomialNB Fit the data into the model Predict the data bypassing x_test Print the classification of the document Printed the precision values of class and Text

After analyzing the results CountVector gives better results compared to TFIDF vectorizer MultinomialNB performed best.

Task 5

Here we are using the ‘Telco-Customer-Churn.csv file for the problem in classification including dual numeric and non-numeric features and we are performing exploratory analysis with 3 classifier algorithms and analyzing the best classifier. Now we are importing all the necessary packages required.

Now we are loading the dataset and printing all the necessary data in a stable manner.

We are defining the shape constraint and all the printing all the types of data.

We are printing all the sum values which are null.

Now we print the data which is assigned as an object.

Here we are printing the column values count with value and its total percentage.

We are assigning the true value to the dataset.

Now we are visualizing the existing data with churn values and depicting the churn and remaining values in a pie chart.

Now we apply the correlation features for monthly charges and total and tenure and displaying the values in a grid format with different colors shown.