Project 1 - Saiaishwaryapuppala/Python-Project-1 GitHub Wiki

PROJECT 1

TEAM โ€“ 5

INTRODUCTION:

Python group project with the below team:

  • Bindu Gadiparthi -49
  • Harinireddy Anumandla -03
  • Rajeshwari Sai Aishwarya Pupppala -35
  • koushik Reddy Sama -42
  • Sai Prasad Raju -07

Objective:

The target of this undertaking is to actualize the distinctive grouping models on the given datasets and picture the outcomes.

Approaches/Methods:

Both directed and solo strategies are associated with the methodologies of the project.

Workflow:

In this project, we will examine the dataset which contains 492 fakes out of 284,807 exchanges. Our target of this undertaking is to fit the dataset into any of our AI models (Random Forest classifiers, Logistic Regression, Decision Tree , and Naive Baye) so as to anticipate correctly while managing the exceptionally uneven dataset. Our subsequent stage is to manage the lopsided issue. We will utilize the over-inspecting strategy (SMOTE) to resample the dataset to make the quantities of cheats and typical exchanges even. The last advance is to think about the AI techniques and we found that Logistic Regression restored the most elevated AUC score.

TASK 1:

CODE:

  • Input code: Imported all the libraries and loaded the dataset using pandas. Considering a fraction of data to reduce computational time we reshaped the dataset and plotted it and the number of fraud and non fraud transactions are printed. Using correlation matrix plotted the heatmap and then we found the accuracy and the classifaction report is printed.
  • Output:

Visualising the number of transactions for each class.

  • The unbalanced would influence the precision of ordering the minor gathering since there was almost no information in the minor gathering. Thus, we chose to manage the unbalanced information by resampling the informational collection. The way toward taking care of unbalanced information included arbitrarily dropping out various occasions in the significant gathering and randomizing the two gatherings together to get another informational collection to prepare with. The outcome turned out with very positive scores.

Task 2

Applying K-means and clustering on the data set

Data Set

Description:

The data consists of information such as Customer ID , Annual Income, Spending score and gender. Considering these constraints we are applying clustering mainly on Annual Income and spending score.

Workflow:

At first we are clearing the data set removing all the nulls if present. Then focus on the data we require and elbow method to calculate the number of clusters is obtained using required constraints. The data is trained and 5 clusters are shown and plotted and silhouette score is calculated.

Objective

  • The elbow method shows the change in the slope was obtained from 5 clusters

  • The silhouette score of about 0.40 and the data did not need to be cleaned up after checking. A silhouette score of 0.40 indicates that the individual clusters are not too near or dense as a score of 1 would be interpreted as dense clusters and 0 would be too spcacious so the ideal value would be 0.5 for clustering.

  • The clustering results shows that the data was split well between 5 clustered groups. We have clustered the data set accordingly and displayed.

Code

Output

Task 3

Predicted the Temperature values using the remaining features like Humidity, Apparent Temperature, we have implemented the Linear Regression model to predict Temperature

Here first we have imported all the required libraries and explored the data, next we have checked null values and replaced null values with rain.

image

Next dropped the Summary, Daily Summary, Formatted Date, and Loud cover from the data(which are less relevant).

image

Building a Linear Regression Model

image

Model Implementation

image

image

Output:

image

image

image

image

image

Task 4

The objective of the fourth task is to clean the given dataset and evaluate the model by applying the different techniques TFIDF, Count_Vectorizer and then evaluate the best algorithm by analyzing the results

Code Explanation : Imported the required libraries Imported the given spam.csv and encoded it and then dropped the unnecessary columns And then described the data as shown in the below screen capture

image

Initialize the count vectorizer and fit the data And then get into a variable word_count_vector

image

Using TfidTransformer algorithm got the most frequently used words in the document Idf_weights gives the least value for the most frequently used word

image

Transform spam_data.Text into count_vector Declared a variable โ€˜hโ€™ to loop through the document to get the frequency of most important words Sort the values by tfidf Restricted loop to 3 iterations as there are a lot of rows in the document

image

Finally trained and tested the data Split the data by Text and Class Applied three algorithms CountVectorizer, TfidTransformer, MultinomialNB Fit the data into the model Predict the data bypassing x_test Print the classification of the document Printed the precision values of class and Text

image

After analyzing the results CountVector gives better results compared to TFIDF vectorizer MultinomialNB performed best.

Task 5

Here we are using the โ€˜Telco-Customer-Churn.csv file for the problem in classification including dual numeric and non-numeric features and we are performing exploratory analysis with 3 classifier algorithms and analyzing the best classifier. Now we are importing all the necessary packages required.

image

Now we are loading the dataset and printing all the necessary data in a stable manner.

image

We are defining the shape constraint and all the printing all the types of data.

image

We are printing all the sum values which are null.

image

Now we print the data which is assigned as an object.

image

Here we are printing the column values count with value and its total percentage.

image

We are assigning the true value to the dataset.

image

Now we are visualizing the existing data with churn values and depicting the churn and remaining values in a pie chart.

image

image

Now we apply the correlation features for monthly charges and total and tenure and displaying the values in a grid format with different colors shown.

image

image

image

Now we are performing the analysis of the total charge and replacing the in-place value with 1 and displaying them in the form of table values.

image

image

Now we perform analysis on different constraints and performing the data correlation.

image

Now are performing the drop columns dataset in the telecom file.

image

Now we normalize the data and perform the analysis using gender and columns.

image

Now analysis of different methods by selection process is made.

image

Now we splitting the data with these values with columns and ranking.

image

Now split the data into train and test values using these ranking.

image

Now we depict the data into a pie chart with churners and remainers.

image

Now performing a classification algorithm called KNN and depicting the slightly increasing graph with score and range.

image

image

image

Now the accuracy of the data is performed with 2 blocks with the accuracy and the values.

image

Now SVM algorithm with grid search with performing training data as true values.

image

Now the accuracy of SVM is performed and the values are shown.

image

image

image

image

Now the values of precision and support are displayed and the values are shown.

image

image

image