Topic Modeling - SeanTater/uncc2014watsonsim GitHub Wiki
We're looking into using topic modeling for many aspects of our project in the upcoming months. Specifically, we think it can provide us the following interesting features:
- Search the document corpus using topic modeling similarity.
- Find the similarity between Q-A A-P and Q-P.
But besides those things, we are also simply exploring the topic space (bit of a pun).
Vector generation methods
- Gensim methods
- LSA, available on Gensim, search index
- LDA, same
- word2vec, same
- GloVe vectors: We use Wikipedia 300 dimension (GWP300), and Common Crawl 840B-token 300 dimension (GCC840)
- Composes Best Predict Vectors (CBP)
Simple evaluations
These are using the equation format used in scripts/gensim/analog.py
Analogical
compare(w('king') - w('man') + w('woman'), w('queen'))
- Should be high (e.g. > 0.6)
queenish=w('king') - w('man') + w('woman'); compare(queenish, w('queen')) - compare(queenish, w('king'))
- Should be positive (low is ok)
compare(w('putin') - w('russia') + w('usa'), w('obama'))
- Should be high
potusish = w('putin') - w('russia') + w('usa'); compare(potusish, w('obama')) - compare(potusish, w('putin'))
- Should be positive
Synonymical
compare(w('democrat'), w('republican'))
- Should be high
compare(w('party'), w('republican')) - compare(w('party'), w('democrat'))
- Should have low magnitude