Disadvantage of LDA and Biterm Topic Model - SoojungHong/TextMining GitHub Wiki

Probabilistic models such as LDA exploit statistical inference to discover latent patterns of data. In short, they infer model parameters from observations. For instance, there is a black box containing many balls with different colors. You draw some balls out from the box and then infer the distributions of colors of the balls. That is a typical process of statistical inference. The accuracy of statistical inference depends on the number of your observations.

Now consider the problem of LDA over short texts. LDA models a document as a mixture of topics, and then each word is drawn from one of its topic. You can imagine a black box contains tons of words generated from such a model. Now you have seen a short document with only a few of words. The observations is obvious too few to infer the parameters. It is the data sparsity problem we mentioned.

Actually, besides the the lack of observations, the problem also comes from the over-complexity of the model. Usually, a more flexible model requires more observations to infer. The Biterm Topic Model tries to making topic inference easier by reducing the model complexity. First, it models the whole corpus as a mixture of topics. Since inferring the topic mixture over the corpus is easier than inferring the topic mixture over a short document. Second, it supposes each biterm is draw from a topic. Inferring the topic of a biterm is also easier than inferring the topic of a single word in LDA, since more context is added.