Comparing Emotional Classification of Text with Different Neural Networks - minalee-research/cs257-students GitHub Wiki

#task-driven, #analysis-driven, #evaluation-driven

Kannon Bhattacharyya, Kruthi Gollapudi

Abstract

Emotional classification is a common task in NLP literature, however, researchers often use different types of emotion classification models to address this. We compared five types of emotion classification models: a Vanilla RNN, a Forward LSTM, a Bidirectional LSTM, an Attention-Based RNN, and a pre-trained BERT on their performance classifying emotions from two types of datasets: Tweets and Reddit Comments. While BERT acts as a pre-trained control, we trained the other four models on each dataset, testing their emotion classification performance both within each dataset as well as how the models perform on the other respective dataset, and providing us with information on the emotion classification ability and cross-applicability of the different models. More specifically, we analyzed how well each model classifies each emotion, providing insight into how different models pick up on different emotions in the context of online posting. We found that the Unidirectional LSTM is the best model for both within and across-dataset analyses; the Vanilla RNNs tend to only output one emotion, while the Bi-Directional LSTMs and Attention-Based RNNs tend to overfit and yield reduced accuracy. Ultimately, the Pre-Trained BERT did not perform as well as hypothesized, and it was often outperformed by the Unidirectional LSTM, though it did perform the best in across-dataset analyses. We additionally conclude that the emotional speech used on Twitter and Reddit are significantly different, that numerous finely-defined labels are more efficient for emotion-classification than broad categories, and that Bi-Directionality inhibits LSTMs in datasets with uneven emotional labels.

What this project is about

The goal of this project is to further the knowledge of emotional classification models in NLP research. We trained four types of NLP models (Vanilla RNN, Forward LSTM, Bidirectional LSTM, Attention-Based RNN) independently on two different datasets, with one containing Tweets and the other containing Reddit comments. We also used a fifth pre-trained BERT model as a baseline, which we additionally train on our training dataset. This research will highlight the pros and cons of certain NLP models in emotional classification, providing to the literature reasons to use or not use them for similar tasks, as well as highlighting the interchangeability of these models with one another. Firstly, our work demonstrates the relative performance of these models with one another in classifying emotions from a separate split of the dataset it was trained on. Second, we test these models on the opposite dataset (Twitter to Reddit and Reddit to Twitter) to demonstrate the cross-applicability of each of these models. Third, the comparisons of each model to the pre-trained BERT highlight the necessity of external vs. internal training in NLP emotional classification tasks. Finally, we compare each of the model predictions to the actual emotions, analyzing which emotions were confused with one another, and analyzing the specific emotion overlaps for each model. We also analyze specific test cases amongst both datasets, showing how their classifications under each model reflect the inner workings and differences between the types of models. While we are interested in evaluating the performance of these models, we are also interested in how emotion is classified when it is in the form of online slang. Since Twitter and Reddit have very unique and distinct styles of speaking compared to one another and the rest of the internet, we are curious what trends each model will pick up on during training, and whether they will differ. Specifically, will models trained on one dataset be able to translate to the other social media’s “language” style and pick up on emotions?

Approach

We used basic PyTorch implementations of our models in Google Colab, since we want the models to mainly differ in their type and be relatively controlled beyond that. Though this may change, we used the following parameters in our models (adjusting for the type of model accordingly, as in some numbers will be doubled in the bi-directional LSTM): batch size = 48, word and hidden embedding dimensions = 256, model layers = 1 for RNNs, 2 for LSTMs, dropout rate = 0.5, gradient clipping value = 0.25, and backpropagation through time = 50. Each model was trained on a "training" split from the datasets. For our BERT model, we used Hugging Face’s BERT-Emotions-Classifier, which includes the Twitter dataset labels in its pre-trained classification. Since the models learn different emotions, we will further simplify the labels into positive and negative: the Reddit dataset is already divided as such, and we will subset the Twitter dataset’s sadness, fear, and anger as negative, surprise as neutral, and joy and love as positive. In addition, we also used a mapping from the 28 Reddit emotions to Twitter’s Ekman emotions to enable cross-dataset analyses for the different emotions. This choice to use datasets with different emotion labels was inspired by “Emotion Analysis in NLP: Trends, Gaps and Roadmap for Future Directions” by Plaza-del-Arco et al. (2024) and the argument to tailor emotions specific to one’s task and to test emotional classification with a variety of labels. Here we have mapped out the author’s Reddit-to-Ekman mapping:

To match the Twitter labels and BERT emotions, we had to adjust the mapping slightly using a combination of emotion translation literature and our own disgression:

Therefore, we can use statistical analyses to successfully predict the classification of Tweets and Reddit comments as positive or negative in the across-dataset predictions, providing us with scores on the cross-applicability of each of the five models. TWe can compare these results to the results of our models as a baseline, using them to further analyze the successes and shortcomings of our four RNNs. Our model-evaluation metrics are general accuracy, cross-entropy loss (H(p,q) = -Σp(x) log q(x), F1-score, and McNemar's test.

Experiments

Data:

The first dataset we used is the Twitter Emotion Classification dataset from Kaggle by Andrey Shtrauss (https://kaggle.com/code/shtrausslearning/twitter-emotion-classification). This dataset contains many Tweets, each classified by six emotion labels: joy, anger, fear, surprise, sadness, and disgust. The training dataset contains 16,000 tweets. The validation and testing dataset each contain 2,000 tweets. The total number of tweets in the dataset is therefore 20,000. Each dataset is stored as a csv containing the Tweets (strings) in the first column and the classification labels (integers) in the second column. The dataset is positively reviewed for its clarity and presentation with no overt problems. Additionally, the vocabulary size is 30,522.

Second, we used the Go Emotions: Google Emotions dataset from Kaggle by Shivam Bansal (https://www.kaggle.com/datasets/shivamb/go-emotions-google-emotions-dataset/data). This dataset contains Reddit comments classified by a scheme containing 28 total emotion labels. The training dataset contains 43,410 Reddit comments, the testing set contains 5,427, and the validation set contains 5,426. So the entire dataset consists of 58,009 items. Each dataset is downloadable from Github as .tsv files, where each row corresponds to a single rater’s annotation of a comment, including the text, unique-id, and metadata about each text entry. Each dataset was designed with emotion classification in mind. We preprocessed each dataset by removing urls, "@" mentions, and punctuation. We later made the decision to remove all “neutral” labeled comments to make the data more usable for both BERT analyses and across-Twitter analyses. This choice lowered the comment distribution to 29,191 training comments, 3,660 validation comments, and 3,640 testing comments.

First we plotted the training split emotion distribution of each dataset.

Twitter

We plotted the Twitter emotions by labels which are the format of the original dataset. The data is most skewed towards sadness and joy, with the other emotions being much less represented.

Reddit

We plotted both the Reddit emotions by original labels, Ekman labels, and our adjusted Twitter-labels. The Reddit data is most skewed towards “neutral” comments. We initially believed this to be the case because many Reddit comments are simply explanatory and cannot be assigned a particular emotion.

Upon converting the emotions to their Ekman mapping, it becomes clear that there are overall more “joy” labels than neutral.

We have also plotted the updated counts excluding any Reddit comments that originally contained a “neutral label.” This change slightly affects the distribution of other labels, since some comments contained “neutral” with some other label.

Finally, converting the labels from Ekman to Twitter-dataset labels evened out the distribution, which should make our analysis more effective, as a result of splitting the “joy” labels into “joy” and “love.”

Cross-Dataset Observations

We noted a couple of aspects of each dataset to keep in mind for our across-dataset observations. First, the Twitter dataset contained one label per sample, while the Reddit dataset samples had the possibility to contain multiple. In terms of the label distributions, the Twitter data is most skewed towards sadness and joy, while the Reddit dataset is most skewed towards neutral and joy. Due to the lack of “neutral” classification in Twitter, we suspect that many of the samples that would be classified as “neutral” for Twitter are absorbed into another category, meaning that we expect some “neutral” categorization error, even when we remove the “neutral” Reddit labels in the cross-categorization analysis. Additionally, the “sadness” training samples are much proportionally larger for Twitter than Reddit. We expect the Reddit model to not perform exceptionally well when classifying Twitter’s “sadness.” Later with the choice to remove all “neutral” comments, our data remained skewed towards “joy.” Since both of our datasets are skewed towards two labels, we can use the analyses to investigate how models perform when datasets are unevenly represented, and whether this disparity results in overfitting for more complex models.

Evaluation Method:

At a basic level, we can use the percentage of correctly classified samples per model. Furthermore, we also examine the percentage of correctly classified samples per emotion. Our main metric of loss is Cross-Entropy Loss, which is the industry standard for calculating the success of a classification model. We calculate Cross-Entropy Loss with the equation H(p,q) = -Σp(x) log q(x) where p(x) is our true probability distribution and q(x) is the model’s predicted probability distribution. By comparing the Cross-Entropy Loss of each model, we will be able to tell which of our four models most effectively classified the emotions portrayed in the Tweets. In addition, we also calculate the F1-Score for each of our models in each of the types of analysis, providing a measure of successful classification in both within-dataset analysis and across-dataset analysis. We finally use McNemar’s test, which allows us to find the comparative significance value of each model, using it to determine whether the performance differences between the models are significant. These evaluation metrics are effective for both within-dataset and across-dataset analyses.

Experimental Details:

To run our experiments, we have trained four new models (Vanilla RNN, LSTM, BiLSTM, Attention-based RNN) through PyTorch on the Twitter dataset using the following parameters: epochs = 10, batch size = 48, word and hidden embedding dimensions = 256, model layers = 1 for RNNs and 2 for LSTMs, dropout rate = 0.5, gradient clipping value = 0.25, and backpropagation through time = 50. We also used the default settings of the Adam optimizer and Cross Entropy loss and trained the models on a Tesla T4 GPU through Google Colab. The same parameters will be used when training the models on the Reddit dataset to maintain consistency. The training process involves: initializing the correct vocab size, initializing the hidden state per batch, passing the input batch through model, compute loss, backpropagate and update weights, and then validate the performance every epoch. The models were also saved locally to allow for later use.

Results

Within-Twitter Results

We first compared the overall accuracy, F1-score, and cross-entropy loss of each model when trained on the training split and tested on the testing split:

While the Vanilla RNN performed poorly and classified only 34.76% of Tweets correctly, the other 4 models performed very well. The Bi-Directional LSTM performed worse than the Unidirectional LSTM, contrasting our hypothesis that the bi-direcitonality of the LSTM would aid its classification. The Unidirectional LSTM, Attention-Based RNN, and Pre-Trained BERT all performed well, having rounded classification results at 93% each. Surprisingly, the Pre-Trained BERT performed as well as the Attention-Based RNN and Unidirectional LSTM, contrasting another hypothesis that the pre-trained BERT would perform significantly better than all the other models. The F1-score results are proportional to the accuracy results, with all of them roughly matching the accuracy percentage except for Vanilla RNN. Despite the similarities in accuracy, the Cross-Entropy Loss differs across the models, giving us more insight into the models. Specifically, the Attention-Based RNN and Pre-Trained BERT have very similar Cross-Entropy Loss, which we attribute to the attention mechanism in both models giving them similar results in all three metrics. Meanwhile, the Unidirectional LSTM has the lowest Cross-Entropy Loss, with the Bi-Directional LSTM being higher at 0.44, and the Vanilla RNN being the highest at 1.58.

Some of our results become clearer when analyzing the accurately classified samples by emotion and model:

These results indicate that the Vanilla RNN’s low classification score was a byproduct of it classifying everything as “joy.” Therefore, the Vanilla RNN’s accuracy is actually just the percentage of “joy” Tweets in the testing split. We believe that this result is due to the skewed amount of “joy” Tweets in the training data; since many of the Tweets were classified as joy, the Vanilla RNN which is simpler in comparison to the other models would have learned to classify most things as “joy.” Surprisingly, it does not classify anything as “sadness,” even though the number of “sadness” Tweets in the training split is almost as much as the number of “joy” tweets. From this result, we can conclude that Vanilla RNNs are not sufficient for emotion classification tasks.

The main difference between the other four models seems to be in the classification of “surprise.” Mainly, the Bi-Directional LSTM’s lower accuracy in comparison to the other three models stems from its inability to classify “surprise” at all, even though the classification accuracy for the other five emotions is relatively similar. Our conclusion is that the Bi-Directional LSTM is overfitting due to the imbalance of emotions, and the lack of “surprise” Tweets does not give the model enough basis to create a representation to predict the label. Our results indicate that Unidirectional LSTMs may be more effective than Bi-Directional LSTMs when input data is not even.

In addition, the Attention-Based RNN and Pre-Trained BERT classify surprise significantly worse than the other emotions, as well as significantly worse than the Unidirectional LSTM. In addition, all models classify “love” worse than the other four emotions. Because “love” and “surprise” are the most underrepresented labels in the input data, we believe that input data quantity is proportional to accurate classification, with the only main exception being Unidirectional-LSTM with “surprise” which is well-classified. In summary, we believe that the Unidirectional-LSTM is the most effective emotion classification model for within-dataset analysis because the model is more complex than the Vanilla RNN which allows it to create more advanced representations, but it is less complex than the Bi-Directional LSTM, Attention-Based RNN, and Pre-Trained BERT which allows it to not succumb to overfitting when the input data is limited and skewed.

Our final analysis was with McNemar's test across each of the models. In McNemar's test, smaller numbers indicate different error distributions and larger numbers indicate similar error distributions, providing a metric for model similarity. In this analysis, we took the log of the values, since they were originally scaled drastically and exponentially and could not be displayed effectively in a heatmap. Therefore, the following table contains the original McNemar’s significance values, and the heatmap contains the log of each McNemar’s value:

As expected, the Vanilla RNN is the most different to each of the models. It is the most different to the Unidirectional LSTM, since the Vanilla RNN is the worst-performing model, while the Unidirectional LSTM is the best performing model. The other results are as expected, except for the fact that the Bi-Directional LSTM has a high similarity to the Unidirectional LSTM despite their different accuracies (compared to the other models). This result indicates that the models are performing similar predictions as a result of their underlying LSTM architecture, despite the disparity in accuracy.

Within-Reddit Results

Like with the Twitter models, we compared the overall accuracy, F1-score, and cross-entropy loss of each model when trained on the training split and tested on the testing split. We did this analysis for both the original 28 labels, as well as the implemented six Twitter labels. First are the 28-label trained models:

These results do not contain Pre-Trained BERT, which is more suited to be compared with the six-label models like itself. Overall, the results are higher and less varied across the board than with the Twitter models. We hypothesize that the more finely-defined labels (28 as opposed to 6) allows the models to be more accurately trained, though this hypothesis will be tested when we train the models with 6 labels. Notably, the Vanilla RNN no longer has the drastic failure that was present in the Twitter data. The comparative results are the same, though on a much smaller scale: the Vanilla RNN performs the worst, the Unidirectional LSTM performs the best, and the Attention-Based RNN and the Bi-Directional LSTM are in between. Though the differences are on a much smaller scale, the results indicate that the trends in performance are similar across datasets. The F1-Score matches the accuracy across the board. Interestingly, the LSTMs have significantly lower test loss than the RNNs, even though the Twitter Attention-Based RNN had lower test loss than its Bi-Directional LSTM.

The McNemar’s tests are similar to the results from the Twitter models, where the Vanilla RNN is the most dissimilar model. Interestingly, the most similar models are the Bi-Directional LSTM and the Attention-Based RNN, while for Twitter it was the Unidirectional LSTM and the Attention-Based RNN. Though the Vanilla RNN is the most different, the difference is not as drastic as it was for the Twitter data, which makes sense considering the high performance of the Reddit Vanilla RNN.

To further compare with the Twitter data and BERT, we then re-trained each model on the six Twitter labels rather than the 28 Reddit labels.

Here we see much lower scores across the board. Interestingly, Vanilla RNN is back to a similar performance as with Twitter, which indicates that the same one-label classification issue is happening here. The Unidirectional LSTM, Bi-Directional LSTM, and Attention-Based RNN are shockingly similar, with the Unidirectional LSTM and Bi-Directional LSTM having identical accuracy. When taking F1-Score into account, it also becomes clear that the Unidirectional LSTM is once again the most effective model. The sharp decrease in classification from using 28 labels to six labels corroborates Plaza-del-Arco et al.’s argument that labels should be chosen specifically for one’s task in emotion classification for better results; when trying to apply labels that are not suited for a certain dataset, emotion classification becomes much more difficult. These results also indicate that Twitter and Reddit have different emotional portrayals in their contents, since the labels that were effective for Twitter classification are not effective for Reddit classification. Interestingly, Pre-Trained BERT performed best out of the within-dataset models, but significantly worse than it had performed on within-Twitter. These results indicate that the emotion classification for this task not only relies on the finely defined labels, but that the larger emotion category mappings are not effective for certain datasets.

As suspected, the Vanilla RNN is once again classifying everything as “joy” just as it did in the Twitter dataset. These results corroborate our previous findings that the Vanilla RNN classifies everything as one label, generally the label that the data is skewed towards, and is therefore not effective for emotion classification. These results also support our argument earlier, that model accuracy is proportional to the prominence of each label in the training data, as can be seen in the LSTM, BiLSTM, and Attention RNN models by the low accuracy scores of “fear” which is low-represented, as well as the relatively higher scores of “joy” and “love” which are high-represented. We suspect that even though most of these categorizations were given by the authors, certain emotions like “curiosity” belonging to surprise are driving some of the model’s failures, which might not be classified as “surprise” in the BERT model. For a certain example, a fairly neutral comment “Ok, fair enough. From your original post, it wasn't really clear that you looked at sources other than YouTube.” is classified purely as “confusion” in the dataset. However upon looking at the BERT prediction for this sample, it overwhelmingly predicts it as “joy” with 93.5% confidence.

This error reveals a disparity between how the BERT model and Reddit data are classified: the data suggests that the BERT model default to joy for “emotionless” text, but due to the finer categories of the Reddit model, there is more room room to have those texts be classified into something other than joy. Therefore, the most effective models will be trained on those specific labels, and more ideally, something emotionally classified by the same system. Like with Twitter, the hypothesis that Pre-Trained BERT would be the most effective model has been proving wrong. However, the more complex models trained on these data and labels would learn to associate that “confusion” with “surprise,” creating a more effective and overall accurate metric.

In addition, we are surprised with how "sadness" was overal poorly categorized throughout every model, despite being well-classified in Twitter, as well as well-represented in the Reddit dataset. These results suggest that sadness in this specific dataset is not very well captured by both models trained on the dataset as well as Pre-Trained on other datasets.

We once again computed a logarithmic heatmap of McNemar’s significance scores. Vanilla RNNs are once again the most “different” from the other models, and here the Pre-Trained BERT is also quite different due to the results disparity. The three successful models self-trained are all relatively close to each other, and the Vanilla RNN and BERT are relatively further from one another. Overall, McNemar's scores are proportional to the accuracy of the models themselves.

Across-Dataset Results

First, we evaluated the models that had been trained on the Reddit dataset, and we tested them on the Twitter testing split, using the same evaluation metrics as before.

We have plotted the original within-Reddit test results in blue, and the Reddit-to-Twitter test results in red. As expected, the accuracies are lower across the board. The F1-score is exactly proportional to the original model accuracies, with Vanilla RNN being the lowest, the three other trained models being in the middle and BERT being the highest. However, the accuracy is very consistent across the four trained models, indicating very little significant differences between how these models perform outside their native datasets. The difference between Vanilla RNN F1-score and accuracy is driven by its ability to classify only joy; while the other models have similar accuracies, they do not have this problem, leading to their higher F1-Score. As expected, BERT performs very well; we hypothesized that Pre-Trained BERT would have the highest across-dataset scores due to its by-nature being an across-dataset model. Though the differences between our trained models are very small, the Unidirectional LSTM has the highest F1-Score and Accuracy, further bolstering our ongoing argument that the Bidirectional LSTM performs the best in these tasks.

Next we analyzed the Twitter models on the Reddit dataset:

These models have similar low results as the other across-dataset analyses, though they are slightly lower. Pre-Trained BERT has an accuracy score of 41%, meaning the additional training of Twitter negatively affected the classification accuracy compared to the original within-Reddit analysis. Once again, the Unidirectional LSTM is the most effective non-pretrained model of emotion classification, which is in-line with the rest of our results. More shockingly, the Vanilla RNN performs drastically better than the Bi-Directional LSTM. This Vanilla RNN is only classifying everything as “joy,” which we know from previous analyses, as well as the F1-Score, meaning that the Bi-Directional LSTM is performing at an extremely low level. These results provide further evidence that the Bi-Directionality of the LSTM inhibits its performance, and that Unidirectional LSTMs are more suited for emotion classification. The poor performance of all these models is surprising, since the non Vanilla RNN models performed very well on within-dataset recognition (~90%). These results indicate that the type of text and labels of emotions in the Reddit dataset are substantially different from how they are classified and written on Twitter.

Analyzing the accuracy by emotion, it becomes clear that the Vanilla RNN is in fact classifying everything as joy. In addition, the Bi-Directional LSTM which could not classify “surprise” in the within-dataset analysis can still not classify surprise. Moreover, “love” and “surprise” are the two categories that are very poorly classified, which supports the argument that these two categories are defined differently within the Reddit dataset than in the Twitter dataset. Though some emotions stand out as shockingly low, none of the accuracy types for the normal “well-classified emotions” are that high either: sadness and joy tend to be around the 50% accuracy mark.

We additionally performed a pairwise t-test to verify that the accuracy results between the across-dataset and within-dataset results were significantly different. Within-Twitter was different from Twitter-to-Reddit with t = 4.20, p = 0.013. Within-Reddit was different from Reddit-to-Twitter with t = 2.92, p = 0.043. We believe that the lower difference for Reddit is not because the Reddit models are more generalizable, rather, because the Reddit models perform worse on within-dataset analyses than the Twitter models.

Discussion and Conclusion

In our emotion classification research, we have arrived at the following conclusions. At their basic implementations, Vanilla RNNs are not suitable for social media emotion classification tasks, especially when the training dataset is unevenly distributed in terms of labels. Unidirectional LSTMs are the most suitable for these datasets compared to Bi-Directional LSTMs and Attention-Based RNNs, which tend to be more prone to overfitting due to their more complex parameters. Pre-Trained BERT models have varied results: they work well when the dataset was intended to be sorted into those labels, but they fall flat when trying to encompass mapped labels (i.e. recognizing “confusion” as “surprise"). Furthermore, datasets with intricately-defined labels provide better results for within-dataset model predictions, increasing accuracy and decreasing overfitting and loss. Model accuracy by emotion has been found to be proportional to label representation within the training dataset. Our success with the within-dataset Reddit models has not only corroborated Plaza-del-Arco et al.’s work that argues for using more varied and diverse label sets for emotion classification NLP, but we have also shown that Unidirectional LSTMs are less prone to errors resulting from few labels, while models like Bi-Directional LSTMs and especially Vanilla RNNs can fail to recognize certain emotions.

Our across-dataset analyses have shown that training datasets on the same labels can still falter in accuracy based on the medium. Even though the Twitter and Reddit models were trained on the same labels, they are not as accurate when classifying samples from the opposite dataset. These results indicate that the language used across Tweets and Reddit comments are different enough to cause inhibitions for emotion classification across the platforms. Even throughout these differences, the Unidirectional LSTM proved to be the most effective emotion classification model that we trained ourselves, with the Bi-Directional LSTM failing in comparison once again. We therefore argue that Bi-Directionality for LSTMs in basic model emotion classification for skewed label datasets tends to inhibit both within and across dataset predictions.

Our Pre-Trained BERT was the most effective model in across-dataset predictions. For within-Reddit predictions, BERT performs significantly better (75%) than the LSTMs and Attention-Based RNN (64-65%). BERT trained on Twitter and tested on Twitter gives 93% accuracy, while BERT trained on Reddit and tested on Twitter gives 75% accuracy. However, BERT trained on Reddit and tested on Reddit gives 75% accuracy, and BERT trained on Twitter and tested on Reddit gives 40% accuracy. Ultimately our results show that additional training on a pre-trained model can improve results, though not necessarily its cross-dataset applicability. Additionally, our models trained on our datasets matched BERT's performance only when trained on the original labels from that dataset, indicating that mapping emotions to larger categories can significantly hurt the performance of within-dataset emotion classification.

References

Olusegun, R., Oladunni, T., Audu, H., Houkpati, Y., & Bengesi, S. (2023). Text Mining and Emotion Classification on Monkeypox Twitter Dataset: A Deep Learning-Natural Language Processing (NLP) Approach. IEEE Access, 11, 49882–49894. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3277868

Plaza-del-Arco, F. M., Curry, A., Curry, A. C., & Hovy, D. (2024). Emotion Analysis in NLP: Trends, Gaps and Roadmap for Future Directions (No. arXiv:2403.01222; Version 1). arXiv. https://doi.org/10.48550/arXiv.2403.01222

Twitter Emotion Classification. (n.d.). Retrieved February 1, 2025, from https://kaggle.com/code/shtrausslearning/twitter-emotion-classification

Go Emotions: Google Emotions Dataset. (n.d.). Retrieved February 17, 2025, from https://www.kaggle.com/datasets/shivamb/go-emotions-google-emotions-dataset/data

Repository

Here is a link to our code and data: https://github.com/kruthig03/emotionclassification