Topic Modeling - SeanTater/uncc2014watsonsim GitHub Wiki

We're looking into using topic modeling for many aspects of our project in the upcoming months. Specifically, we think it can provide us the following interesting features:

Search the document corpus using topic modeling similarity.
Find the similarity between Q-A A-P and Q-P.

But besides those things, we are also simply exploring the topic space (bit of a pun).

Vector generation methods

Gensim methods
- LSA, available on Gensim, search index
- LDA, same
- word2vec, same
GloVe vectors: We use Wikipedia 300 dimension (GWP300), and Common Crawl 840B-token 300 dimension (GCC840)
Composes Best Predict Vectors (CBP)

Simple evaluations

These are using the equation format used in scripts/gensim/analog.py

Three Semantic Equations

Analogical Quality

Analogical

compare(w('king') - w('man') + w('woman'), w('queen'))
- Should be high (e.g. > 0.6)
queenish=w('king') - w('man') + w('woman'); compare(queenish, w('queen')) - compare(queenish, w('king'))
- Should be positive (low is ok)
compare(w('putin') - w('russia') + w('usa'), w('obama'))
- Should be high
potusish = w('putin') - w('russia') + w('usa'); compare(potusish, w('obama')) - compare(potusish, w('putin'))
- Should be positive

Synonymical

compare(w('democrat'), w('republican'))
- Should be high
compare(w('party'), w('republican')) - compare(w('party'), w('democrat'))
- Should have low magnitude