Topic Modelling - Patrisimo/Reddit GitHub Wiki

Topic Modelling Experiments

This project focuses on the question: Can a topic model distinguish between contentful and non-contentful words? That is, topic models tend to suffer from overly common words taking up a large portion of the probability space, and from uncommon words mostly taking up space. Given a topic model, can we distinguish these three classes of words? If we can, can this help us with other tasks?

If we remove the non-contentful words and train a new topic model, does it perform better?
Can we identify synonyms better?
Can we cluster topics better?

Of course, these also need to be compared against a baseline. For removing words, we will just look at usage thresholds. For the others, we want to look at the default topic model.

The Data

Let's go with a couple different datasets.

500k reddit comments, randomly chosen from March 2017
50k reddit comments from the 20 most common subreddits from March 2017:

AskReddit
politics
The_Donald
worldnews
nba
RocketLeagueExchange
pics
funny
videos
gaming
NintendoSwitch
leagueoflegends
news
Overwatch
Cricket
todayilearned
movies
SquaredCircle
pcmasterrace
gifs

Synonyms

This is maybe the best place to start, since it produces output that is easiest to interpret. This will be based on a 100-topic topic model. First, the gensim model .get_topics() function returns the probability distribution of words across each topic, that is the sum across words is 1. We might ask ourselves what happens if we sum across topics, which results in more or less a usage list:

|the|3.055| |and|1.156| |for|0.784| |this|0.755| |that|0.601|

Actually you know what, I'm just going to switch to having these posts as latex docs in their respective folders