Topic Modelling - Patrisimo/Reddit GitHub Wiki
Topic Modelling Experiments
This project focuses on the question: Can a topic model distinguish between contentful and non-contentful words? That is, topic models tend to suffer from overly common words taking up a large portion of the probability space, and from uncommon words mostly taking up space. Given a topic model, can we distinguish these three classes of words? If we can, can this help us with other tasks?
- If we remove the non-contentful words and train a new topic model, does it perform better?
- Can we identify synonyms better?
- Can we cluster topics better?
Of course, these also need to be compared against a baseline. For removing words, we will just look at usage thresholds. For the others, we want to look at the default topic model.
The Data
Let's go with a couple different datasets.
- 500k reddit comments, randomly chosen from March 2017
- 50k reddit comments from the 20 most common subreddits from March 2017:
- AskReddit
- politics
- The_Donald
- worldnews
- nba
- RocketLeagueExchange
- pics
- funny
- videos
- gaming
- NintendoSwitch
- leagueoflegends
- news
- Overwatch
- Cricket
- todayilearned
- movies
- SquaredCircle
- pcmasterrace
- gifs
Synonyms
This is maybe the best place to start, since it produces output that is easiest to interpret. This will be based on a 100-topic topic model. First, the gensim model .get_topics() function returns the probability distribution of words across each topic, that is the sum across words is 1. We might ask ourselves what happens if we sum across topics, which results in more or less a usage list:
|the|3.055| |and|1.156| |for|0.784| |this|0.755| |that|0.601|
Actually you know what, I'm just going to switch to having these posts as latex docs in their respective folders