Model‐Information - SubhaNAG2001/sentiment-analysis GitHub Wiki
Model Information
This page provides detailed information about the machine learning model used for sentiment analysis in this application.
Overview
The sentiment analysis model is trained to classify text as Positive, Negative, or Neutral. It uses a combination of text preprocessing, feature extraction, and machine learning classification.
Model Architecture
The model follows a standard NLP pipeline:
- Text Preprocessing: Cleaning and normalizing text
- Feature Extraction: Converting text to numerical features
- Classification: Predicting sentiment based on features
Components
Text Preprocessing
The preprocessing pipeline includes:
- Converting text to lowercase
- Removing Twitter-specific elements (mentions, hashtags, RT)
- Removing URLs
- Removing punctuation
- Removing stopwords (common words like "the", "and", etc.)
- Normalizing whitespace
Implementation can be found in the clean_text()
function in app.py
.
Feature Extraction
The model uses Term Frequency-Inverse Document Frequency (TF-IDF) vectorization to convert text into numerical features. This technique:
- Captures the importance of words in a document relative to a corpus
- Reduces the impact of common words
- Creates a sparse matrix representation of the text
The TF-IDF vectorizer is trained on the training dataset and saved as tfidf_vectorizer.pkl
.
Classification Algorithm
The model uses Logistic Regression for classification. This algorithm was chosen for its:
- Good performance on text classification tasks
- Ability to provide probability estimates (confidence scores)
- Interpretability
- Efficiency in both training and prediction
The trained model is saved as sentiment_model.pkl
.
Training Data
The model was trained on a dataset of tweets with labeled sentiments. The dataset includes a diverse range of topics and expressions to ensure the model generalizes well to different types of text.
Performance Metrics
The model was evaluated on a held-out test set with the following metrics:
- Accuracy: How often the model predicts the correct sentiment
- Precision: The ratio of true positive predictions to all positive predictions
- Recall: The ratio of true positive predictions to all actual positives
- F1 Score: The harmonic mean of precision and recall
Detailed performance metrics can be found in the model.ipynb
notebook.
Limitations
The model has some limitations to be aware of:
- It may not perform well on domain-specific text that wasn't represented in the training data
- It may struggle with sarcasm, irony, and other complex language features
- It's trained primarily on English text and may not work well with other languages
Model Training
The complete model training process is documented in the model.ipynb
Jupyter notebook, which includes:
- Data loading and exploration
- Text preprocessing
- Feature extraction
- Model training and tuning
- Evaluation
- Model serialization
Retraining the Model
If you want to retrain the model with your own data or different parameters:
- Modify the
model.ipynb
notebook as needed - Run all cells to train and evaluate the model
- The notebook will save the new model and vectorizer files
- Replace the existing
.pkl
files with the newly generated ones