LDA and Topic Modeling - SoojungHong/TextMining GitHub Wiki

LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics Each topic is characterized by a distribution over words.

LDA generative process for each document w in a corpus D:

  1. Choose N ∼ Poisson(ξ)

  2. Choose θ ∼ Dir(α) 

  3. For each of the N words wn:

       (a) Choose a topic zn ∼ Multinomial(θ). 

       (b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.

One of good explanation of LDA

http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Graphical Model corresponding to LDA

LDA Model

In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you Decide on the number of words N the document will have (say, according to a Poisson distribution).

Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.

Generate each word w_i in the document by:

First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

Key Assumptions behind the LDA Topic Model

Documents exhibit multiple topics (but typically not many)

LDA is a probabilistic model with a corresponding generative process each document is assumed to be generated by this (simple) process.

A topic is a distribution over a fixed vocabulary – these topics are assumed to be generated first, before the documents.

Generative Process

To generate a document:

Randomly choose a distribution over topics
For each word in the document

2a. randomly choose a topic from the distribution over topics

2b. randomly choose a word from the corresponding topic (distribution over the vocabulary)

Note that we need a distribution over a distribution (for step 1)
Note that words are generated independently of other words (unigram bag-of-words model)

The Generative Process more Formally

Please see the formula from following site https://www.cl.cam.ac.uk/teaching/1213/L101/clark_lectures/lect7.pdf