LDA and Topic Modeling - SoojungHong/TextMining GitHub Wiki
LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics Each topic is characterized by a distribution over words.
LDA generative process for each document w in a corpus D:
1. Choose N ∼ Poisson(ξ)
2. Choose θ ∼ Dir(α)
3. For each of the N words wn:
(a) Choose a topic zn ∼ Multinomial(θ).
(b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.
One of good explanation of LDA
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
Graphical Model corresponding to LDA
LDA Model
In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
- First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
- Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
Key Assumptions behind the LDA Topic Model
Documents exhibit multiple topics (but typically not many)
LDA is a probabilistic model with a corresponding generative process each document is assumed to be generated by this (simple) process.
A topic is a distribution over a fixed vocabulary – these topics are assumed to be generated first, before the documents.
Generative Process
To generate a document:
- Randomly choose a distribution over topics
- For each word in the document
2a. randomly choose a topic from the distribution over topics
2b. randomly choose a word from the corresponding topic (distribution over the vocabulary)
- Note that we need a distribution over a distribution (for step 1)
- Note that words are generated independently of other words (unigram bag-of-words model)
The Generative Process more Formally
Please see the formula from following site https://www.cl.cam.ac.uk/teaching/1213/L101/clark_lectures/lect7.pdf