Model‐Information - SubhaNAG2001/sentiment-analysis GitHub Wiki

Model Information

This page provides detailed information about the machine learning model used for sentiment analysis in this application.

Overview

The sentiment analysis model is trained to classify text as Positive, Negative, or Neutral. It uses a combination of text preprocessing, feature extraction, and machine learning classification.

Model Architecture

The model follows a standard NLP pipeline:

Text Preprocessing: Cleaning and normalizing text
Feature Extraction: Converting text to numerical features
Classification: Predicting sentiment based on features

Components

Text Preprocessing

The preprocessing pipeline includes:

Converting text to lowercase
Removing Twitter-specific elements (mentions, hashtags, RT)
Removing URLs
Removing punctuation
Removing stopwords (common words like "the", "and", etc.)
Normalizing whitespace

Implementation can be found in the clean_text() function in app.py.

Feature Extraction

The model uses Term Frequency-Inverse Document Frequency (TF-IDF) vectorization to convert text into numerical features. This technique:

Captures the importance of words in a document relative to a corpus
Reduces the impact of common words
Creates a sparse matrix representation of the text

The TF-IDF vectorizer is trained on the training dataset and saved as tfidf_vectorizer.pkl.

Classification Algorithm

The model uses Logistic Regression for classification. This algorithm was chosen for its:

Good performance on text classification tasks
Ability to provide probability estimates (confidence scores)
Interpretability
Efficiency in both training and prediction

The trained model is saved as sentiment_model.pkl.

Training Data

The model was trained on a dataset of tweets with labeled sentiments. The dataset includes a diverse range of topics and expressions to ensure the model generalizes well to different types of text.

Performance Metrics

The model was evaluated on a held-out test set with the following metrics:

Accuracy: How often the model predicts the correct sentiment
Precision: The ratio of true positive predictions to all positive predictions
Recall: The ratio of true positive predictions to all actual positives
F1 Score: The harmonic mean of precision and recall

Detailed performance metrics can be found in the model.ipynb notebook.

Limitations

The model has some limitations to be aware of:

It may not perform well on domain-specific text that wasn't represented in the training data
It may struggle with sarcasm, irony, and other complex language features
It's trained primarily on English text and may not work well with other languages

Model Training

The complete model training process is documented in the model.ipynb Jupyter notebook, which includes:

Data loading and exploration
Text preprocessing
Feature extraction
Model training and tuning
Evaluation
Model serialization

Retraining the Model

If you want to retrain the model with your own data or different parameters:

Modify the model.ipynb notebook as needed
Run all cells to train and evaluate the model
The notebook will save the new model and vectorizer files
Replace the existing .pkl files with the newly generated ones