Topic Modeling - SeanTater/uncc2014watsonsim GitHub Wiki

We're looking into using topic modeling for many aspects of our project in the upcoming months. Specifically, we think it can provide us the following interesting features:

  1. Search the document corpus using topic modeling similarity.
  2. Find the similarity between Q-A A-P and Q-P.

But besides those things, we are also simply exploring the topic space (bit of a pun).

Vector generation methods

Simple evaluations

These are using the equation format used in scripts/gensim/analog.py

Three Semantic Equations

Analogical Quality

Analogical

  • compare(w('king') - w('man') + w('woman'), w('queen'))
    • Should be high (e.g. > 0.6)
  • queenish=w('king') - w('man') + w('woman'); compare(queenish, w('queen')) - compare(queenish, w('king'))
    • Should be positive (low is ok)
  • compare(w('putin') - w('russia') + w('usa'), w('obama'))
    • Should be high
  • potusish = w('putin') - w('russia') + w('usa'); compare(potusish, w('obama')) - compare(potusish, w('putin'))
    • Should be positive

Synonymical

  • compare(w('democrat'), w('republican'))
    • Should be high
  • compare(w('party'), w('republican')) - compare(w('party'), w('democrat'))
    • Should have low magnitude