Project 1 - Saiaishwaryapuppala/Python-Project-1 GitHub Wiki
PROJECT 1
TEAM โ 5
INTRODUCTION:
Python group project with the below team:
- Bindu Gadiparthi -49
- Harinireddy Anumandla -03
- Rajeshwari Sai Aishwarya Pupppala -35
- koushik Reddy Sama -42
- Sai Prasad Raju -07
Objective:
The target of this undertaking is to actualize the distinctive grouping models on the given datasets and picture the outcomes.
Approaches/Methods:
Both directed and solo strategies are associated with the methodologies of the project.
Workflow:
In this project, we will examine the dataset which contains 492 fakes out of 284,807 exchanges. Our target of this undertaking is to fit the dataset into any of our AI models (Random Forest classifiers, Logistic Regression, Decision Tree , and Naive Baye) so as to anticipate correctly while managing the exceptionally uneven dataset. Our subsequent stage is to manage the lopsided issue. We will utilize the over-inspecting strategy (SMOTE) to resample the dataset to make the quantities of cheats and typical exchanges even. The last advance is to think about the AI techniques and we found that Logistic Regression restored the most elevated AUC score.
TASK 1:
CODE:
- Input code: Imported all the libraries and loaded the dataset using pandas. Considering a fraction of data to reduce computational time we reshaped the dataset and plotted it and the number of fraud and non fraud transactions are printed. Using correlation matrix plotted the heatmap and then we found the accuracy and the classifaction report is printed.
- Output:
Visualising the number of transactions for each class.
- The unbalanced would influence the precision of ordering the minor gathering since there was almost no information in the minor gathering. Thus, we chose to manage the unbalanced information by resampling the informational collection. The way toward taking care of unbalanced information included arbitrarily dropping out various occasions in the significant gathering and randomizing the two gatherings together to get another informational collection to prepare with. The outcome turned out with very positive scores.
Task 2
Applying K-means and clustering on the data set
Data Set
Description:
The data consists of information such as Customer ID , Annual Income, Spending score and gender. Considering these constraints we are applying clustering mainly on Annual Income and spending score.
Workflow:
At first we are clearing the data set removing all the nulls if present. Then focus on the data we require and elbow method to calculate the number of clusters is obtained using required constraints. The data is trained and 5 clusters are shown and plotted and silhouette score is calculated.
Objective
-
The elbow method shows the change in the slope was obtained from 5 clusters
-
The silhouette score of about 0.40 and the data did not need to be cleaned up after checking. A silhouette score of 0.40 indicates that the individual clusters are not too near or dense as a score of 1 would be interpreted as dense clusters and 0 would be too spcacious so the ideal value would be 0.5 for clustering.
-
The clustering results shows that the data was split well between 5 clustered groups. We have clustered the data set accordingly and displayed.
Code
Output
Task 3
Predicted the Temperature values using the remaining features like Humidity, Apparent Temperature, we have implemented the Linear Regression model to predict Temperature
Here first we have imported all the required libraries and explored the data, next we have checked null values and replaced null values with rain.
Next dropped the Summary, Daily Summary, Formatted Date, and Loud cover from the data(which are less relevant).
Building a Linear Regression Model
Model Implementation
Output:
Task 4
The objective of the fourth task is to clean the given dataset and evaluate the model by applying the different techniques TFIDF, Count_Vectorizer and then evaluate the best algorithm by analyzing the results
Code Explanation : Imported the required libraries Imported the given spam.csv and encoded it and then dropped the unnecessary columns And then described the data as shown in the below screen capture
Initialize the count vectorizer and fit the data And then get into a variable word_count_vector
Using TfidTransformer algorithm got the most frequently used words in the document Idf_weights gives the least value for the most frequently used word
Transform spam_data.Text into count_vector Declared a variable โhโ to loop through the document to get the frequency of most important words Sort the values by tfidf Restricted loop to 3 iterations as there are a lot of rows in the document
Finally trained and tested the data Split the data by Text and Class Applied three algorithms CountVectorizer, TfidTransformer, MultinomialNB Fit the data into the model Predict the data bypassing x_test Print the classification of the document Printed the precision values of class and Text
After analyzing the results CountVector gives better results compared to TFIDF vectorizer MultinomialNB performed best.
Task 5
Here we are using the โTelco-Customer-Churn.csv file for the problem in classification including dual numeric and non-numeric features and we are performing exploratory analysis with 3 classifier algorithms and analyzing the best classifier. Now we are importing all the necessary packages required.
Now we are loading the dataset and printing all the necessary data in a stable manner.
We are defining the shape constraint and all the printing all the types of data.
We are printing all the sum values which are null.
Now we print the data which is assigned as an object.
Here we are printing the column values count with value and its total percentage.
We are assigning the true value to the dataset.
Now we are visualizing the existing data with churn values and depicting the churn and remaining values in a pie chart.
Now we apply the correlation features for monthly charges and total and tenure and displaying the values in a grid format with different colors shown.
Now we are performing the analysis of the total charge and replacing the in-place value with 1 and displaying them in the form of table values.
Now we perform analysis on different constraints and performing the data correlation.
Now are performing the drop columns dataset in the telecom file.
Now we normalize the data and perform the analysis using gender and columns.
Now analysis of different methods by selection process is made.
Now we splitting the data with these values with columns and ranking.
Now split the data into train and test values using these ranking.
Now we depict the data into a pie chart with churners and remainers.
Now performing a classification algorithm called KNN and depicting the slightly increasing graph with score and range.
Now the accuracy of the data is performed with 2 blocks with the accuracy and the values.
Now SVM algorithm with grid search with performing training data as true values.
Now the accuracy of SVM is performed and the values are shown.
Now the values of precision and support are displayed and the values are shown.