Home - Patrisimo/Reddit GitHub Wiki

This is the directory for my investigation of what sort of information can be determined from a bunch of reddit comments from March 2017 (500k per day).

Project lists:

  • Word2vec transformations
  • Topic modeling
  • Phrase wordcount
  • Word embeddings between languages

Initial approach: Identify several (like O(500)) common, meaningful words and see how well the embeddings match on them. Further, use those anchors to align* the embeddings and look at optimal transport for the rest.

  • How do word embeddings differ in different contexts?

Initial approach: Take word embeddings learned from different subreddits and evaluate where they align well and where they align poorly.

  • Can word embeddings pick up on changes in language usage across shorter time scales (a month)?

Initial approach: Take word embeddings learned from different times and evaluate where they align well and where they align poorly

  • Can we translate a classification model from one language to another?

Initial approach: Learn a classifier (initially Naive Bayes, since it's more obvious how to artificially create one) to learn whether a comment is "controversial" or not (this is a flag that indicates many and similar numbers of upvotes and downvotes) on one language. Then use some method (such as the cross-lingual optimal transport from above) to have this model run on comments from another language.