pLSA (Probabilistic Latent Semantic Analysis) - SoojungHong/TextMining GitHub Wiki
good reference
https://cs.stanford.edu/~ppasupat/a9online/1140.html
LSA, pLSA, LDA
LSA, PLSA, and LDA are methods for modeling semantics of words based on topics. Main Idea Words with similar meaning will occur in similar documents.
cat dog computer internet rabbit
D1 1 1 0 0 2 D2 3 2 0 0 1 D3 0 0 3 4 0 D4 5 0 2 5 0 D5 2 1 0 0 1
The dot product of row vectors is the document similarity, while the dot product of column vectors is the word similarity.
To reduce the dimensionality of X, apply truncated SVD: X≈UtΣtVTt
Latent Semantic Analysis
The latent in Latent Semantic Analysis (LSA) means latent topics. Basically, LSA finds low-dimension representation of documents and words. Given documents d1,…,dm and vocabulary words w1,…,wn, we construct a document-term matrix X∈Rm×n where xij describes the occurrence of word wj in document di. (For example, xij can be the raw count, 0-1 count, or TF-IDF.)
We generalize PLSA by changing the fixed d to a Dirichlet prior.
The generative process for each word wj (from a vocab of size V) in document di is as follow:
Choose θi∼Dir(α) (where i=1,…,M; θi∈ΔK)
θi,k = probability that document i∈{1,…,M} has topic k∈{1,…,K}.
Choose ϕk∼Dir(β) (where k=1,…,K; ϕk∈ΔV)
ϕk,v = probability of word v∈{1,…,V} in topic k∈{1,…,K}.
Choose ci,j∼Polynomial(θi) (where ci,j∈{1,…,K})
Choose wi,j∼Polynomial(ϕci,j) (where wi,j∈{1,…,V})
pLSA (Probabilistic Latent Semantic Analysis)
Instead of using matrices, Probabilistic Latent Semantic Analysis (PLSA) uses a probabilistic method. Its graphical model is
P(w∣d)=P(d)∑cP(c∣d)P(w∣c)
d = document index c = word's topic drawn from P(c∣d) w = word drawn from P(w∣c)
Both P(c∣d) and P(w∣c) are modeled as multinomial distributions. The parameters can be trained with EM or whatever.
One pitfall is the lack of parameters for P(d), so we don't know how to assign probability to a new document. Another is that the number of parameters for P(c∣d) grows linearly with the number of documents, which leads to overfitting.