Model‐Information - SubhaNAG2001/sentiment-analysis GitHub Wiki

Model Information

This page provides detailed information about the machine learning model used for sentiment analysis in this application.

Overview

The sentiment analysis model is trained to classify text as Positive, Negative, or Neutral. It uses a combination of text preprocessing, feature extraction, and machine learning classification.

Model Architecture

The model follows a standard NLP pipeline:

  1. Text Preprocessing: Cleaning and normalizing text
  2. Feature Extraction: Converting text to numerical features
  3. Classification: Predicting sentiment based on features

Components

Text Preprocessing

The preprocessing pipeline includes:

  • Converting text to lowercase
  • Removing Twitter-specific elements (mentions, hashtags, RT)
  • Removing URLs
  • Removing punctuation
  • Removing stopwords (common words like "the", "and", etc.)
  • Normalizing whitespace

Implementation can be found in the clean_text() function in app.py.

Feature Extraction

The model uses Term Frequency-Inverse Document Frequency (TF-IDF) vectorization to convert text into numerical features. This technique:

  • Captures the importance of words in a document relative to a corpus
  • Reduces the impact of common words
  • Creates a sparse matrix representation of the text

The TF-IDF vectorizer is trained on the training dataset and saved as tfidf_vectorizer.pkl.

Classification Algorithm

The model uses Logistic Regression for classification. This algorithm was chosen for its:

  • Good performance on text classification tasks
  • Ability to provide probability estimates (confidence scores)
  • Interpretability
  • Efficiency in both training and prediction

The trained model is saved as sentiment_model.pkl.

Training Data

The model was trained on a dataset of tweets with labeled sentiments. The dataset includes a diverse range of topics and expressions to ensure the model generalizes well to different types of text.

Performance Metrics

The model was evaluated on a held-out test set with the following metrics:

  • Accuracy: How often the model predicts the correct sentiment
  • Precision: The ratio of true positive predictions to all positive predictions
  • Recall: The ratio of true positive predictions to all actual positives
  • F1 Score: The harmonic mean of precision and recall

Detailed performance metrics can be found in the model.ipynb notebook.

Limitations

The model has some limitations to be aware of:

  • It may not perform well on domain-specific text that wasn't represented in the training data
  • It may struggle with sarcasm, irony, and other complex language features
  • It's trained primarily on English text and may not work well with other languages

Model Training

The complete model training process is documented in the model.ipynb Jupyter notebook, which includes:

  • Data loading and exploration
  • Text preprocessing
  • Feature extraction
  • Model training and tuning
  • Evaluation
  • Model serialization

Retraining the Model

If you want to retrain the model with your own data or different parameters:

  1. Modify the model.ipynb notebook as needed
  2. Run all cells to train and evaluate the model
  3. The notebook will save the new model and vectorizer files
  4. Replace the existing .pkl files with the newly generated ones