Home - Patrisimo/Reddit GitHub Wiki

This is the directory for my investigation of what sort of information can be determined from a bunch of reddit comments from March 2017 (500k per day).

Project lists:

Word2vec transformations
Topic modeling
Phrase wordcount
Word embeddings between languages

Initial approach: Identify several (like O(500)) common, meaningful words and see how well the embeddings match on them. Further, use those anchors to align* the embeddings and look at optimal transport for the rest.

How do word embeddings differ in different contexts?

Initial approach: Take word embeddings learned from different subreddits and evaluate where they align well and where they align poorly.

Can word embeddings pick up on changes in language usage across shorter time scales (a month)?

Initial approach: Take word embeddings learned from different times and evaluate where they align well and where they align poorly

Can we translate a classification model from one language to another?

Initial approach: Learn a classifier (initially Naive Bayes, since it's more obvious how to artificially create one) to learn whether a comment is "controversial" or not (this is a flag that indicates many and similar numbers of upvotes and downvotes) on one language. Then use some method (such as the cross-lingual optimal transport from above) to have this model run on comments from another language.