Machine Learning: Model - QEDK/clarity GitHub Wiki

We have used TensorFlow Keras to construct the machine learning model for emotion analysis. It is a bi-directional LSTM model trained on 40,000 tweets with crowdsourced mood data. As expected, a lot of data cleaning is required before the data is usable for training and prediction, in order:

  1. Correct spellings (using GNU Aspell dataset)
  2. Expand contractions (using custom regex and Kaggle dataset)
  3. Remove mentions and URLs (using the Python tweet-preprocessor package)
  4. Parse emojis into text (using the Python emoji package)
  5. Remove punctuation (from string.punctuation)
  6. Remove stop words (with spaCy-defined stopwords)

We then use an embedding matrix using the GloVE dataset vectors with 6 billion unique vectors of 300 dimensions and a corpus vocabulary of 400K words as the input layer, feed it to our bi-directional LSTM, compress the features and finally get a probabilistic outcome from the softmax layer.

The entirety of the training data comes to ~1 GiB and we get a model of ~86 MB.