LDA and Topic Modeling - SoojungHong/TextMining GitHub Wiki
![]()
LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics Each topic is characterized by a distribution over words.
LDA generative process for each document w in a corpus D:
1. Choose N ∼ Poisson(ξ)
2. Choose θ ∼ Dir(α)
3. For each of the N words wn:
(a) Choose a topic zn ∼ Multinomial(θ).
(b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.

One of good explanation of LDA
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
Graphical Model corresponding to LDA
![]()
LDA Model
In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
- First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
- Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
Key Assumptions behind the LDA Topic Model
Documents exhibit multiple topics (but typically not many)
LDA is a probabilistic model with a corresponding generative process each document is assumed to be generated by this (simple) process.
A topic is a distribution over a fixed vocabulary – these topics are assumed to be generated first, before the documents.
Generative Process
To generate a document:
- Randomly choose a distribution over topics
- For each word in the document
2a. randomly choose a topic from the distribution over topics
2b. randomly choose a word from the corresponding topic (distribution over the vocabulary)
- Note that we need a distribution over a distribution (for step 1)
- Note that words are generated independently of other words (unigram bag-of-words model)
The Generative Process more Formally
Please see the formula from following site https://www.cl.cam.ac.uk/teaching/1213/L101/clark_lectures/lect7.pdf