Home - Patrisimo/Reddit GitHub Wiki
This is the directory for my investigation of what sort of information can be determined from a bunch of reddit comments from March 2017 (500k per day).
Project lists:
- Word2vec transformations
- Topic modeling
- Phrase wordcount
- Word embeddings between languages
Initial approach: Identify several (like O(500)) common, meaningful words and see how well the embeddings match on them. Further, use those anchors to align* the embeddings and look at optimal transport for the rest.
- How do word embeddings differ in different contexts?
Initial approach: Take word embeddings learned from different subreddits and evaluate where they align well and where they align poorly.
- Can word embeddings pick up on changes in language usage across shorter time scales (a month)?
Initial approach: Take word embeddings learned from different times and evaluate where they align well and where they align poorly
- Can we translate a classification model from one language to another?
Initial approach: Learn a classifier (initially Naive Bayes, since it's more obvious how to artificially create one) to learn whether a comment is "controversial" or not (this is a flag that indicates many and similar numbers of upvotes and downvotes) on one language. Then use some method (such as the cross-lingual optimal transport from above) to have this model run on comments from another language.