Machine Learning: Model - QEDK/clarity GitHub Wiki
We have used TensorFlow Keras to construct the machine learning model for emotion analysis. It is a bi-directional LSTM model trained on 40,000 tweets with crowdsourced mood data. As expected, a lot of data cleaning is required before the data is usable for training and prediction, in order:
- Correct spellings (using GNU Aspell dataset)
- Expand contractions (using custom regex and Kaggle dataset)
- Remove mentions and URLs (using the Python
tweet-preprocessor
package) - Parse emojis into text (using the Python
emoji
package) - Remove punctuation (from
string.punctuation
) - Remove stop words (with spaCy-defined stopwords)
We then use an embedding matrix using the GloVE dataset vectors with 6 billion unique vectors of 300 dimensions
and a corpus vocabulary of 400K words as the input layer, feed it to our bi-directional LSTM, compress the features
and finally get a probabilistic outcome from the softmax
layer.
The entirety of the training data comes to ~1 GiB and we get a model of ~86 MB.