ICP_10 - Girees737/KDM_Projects GitHub Wiki

Name : Gireesh Kumar Muppalla

Email : [email protected]

Software Required

Python, Colab, Jupyter Notebook

Summary

In this lesson, I have understood how we can preprocess the text using pandas library with respect to sentiment analysis, the sampling techniques when the class distribution is imbalanced, extracting features using count vectorizer and inverse document frequency of sklearn library. Also learnt how we can leverage deep leaning for building the classification models on the text data.

Implementation:

Task-1 : To run the given source code and explanation

Loaded the required libraries and downloaded the data

Read the downloaded data from the above step and dropped the unnecessary columns as below.

calculated the polarity based on the rating column.

Plotted the count distribution of each sentiment using seaborn library as below.

As the positive samples are more, selected only top 8000 samples of positives to overcome imbalancing problem.

Oversampled the the minority classes like neutral and negative to 8000 samples with replacement.

Defined a method to clean all the text data for stopwords, case to lower and to remove strings from the data and applied on the text.

One hot encoded the output labels for fitting the data to neural networks.

Splitted the text data into 70 percent as train data and 30 percent as test data.

Applied the count vectorizer and fitted IDF on treain data and transformed test data on the IDF object fitted on train data to prevent the data leakage.

Defined the sequential deep neural network with one input layer, 2 hidden layers and one output layer using tensorflow library.

Fitted the data to deep neural network model with training parameters like batch size of 256 epochs of 100 samples and evaluated the model on test data as below.

Task -2 : Change in parameters and to observe its change in performance with respect to original model parameters.

a. Change in train test ratio.

I have changed the train test ration to 77:25 and observed that slight change in accuracy on test data as below.

b. Change in no of layers in deep neural network.

Added the new layer with 1400 neurons in the hidden layer and haven't seen much difference in the accuracy as it was 92% which is close to the one from original parameters.

Removed the hidden layer and haven't observed much difference in the accuracy as it was 92% which is close to the one from original parameters.

c. To change the drop out ratio.

I have changed the drop out ratio which is a regularization technique to 0.35 from 0.60 and have seen that accuracy decreased to 0.91.

d. To reduce the sample size of minority samples while over sampling.

Reduced the no. of samples of neutral and Negative feedbacks to oversample to 4000 with replacement using pandas library which is 4000 less than the majority class and observed that accuracy has fallen to 0.85. This is because when we oversample with replacements, it might have overfit with samples created on train data.