pLSA (Probabilistic Latent Semantic Analysis) - SoojungHong/TextMining GitHub Wiki

good reference

https://cs.stanford.edu/~ppasupat/a9online/1140.html

LSA, pLSA, LDA

LSA, PLSA, and LDA are methods for modeling semantics of words based on topics. Main Idea Words with similar meaning will occur in similar documents.

cat 	dog 	computer 	internet 	rabbit

D1 1 1 0 0 2 D2 3 2 0 0 1 D3 0 0 3 4 0 D4 5 0 2 5 0 D5 2 1 0 0 1

The dot product of row vectors is the document similarity, while the dot product of column vectors is the word similarity.

To reduce the dimensionality of X, apply truncated SVD: X≈UtΣtVTt

Latent Semantic Analysis

The latent in Latent Semantic Analysis (LSA) means latent topics. Basically, LSA finds low-dimension representation of documents and words. Given documents d1,…,dm and vocabulary words w1,…,wn, we construct a document-term matrix X∈Rm×n where xij describes the occurrence of word wj in document di. (For example, xij can be the raw count, 0-1 count, or TF-IDF.)

We generalize PLSA by changing the fixed d to a Dirichlet prior.

The generative process for each word wj (from a vocab of size V) in document di is as follow:

Choose θi∼Dir(α) (where i=1,…,M; θi∈ΔK)
    θi,k = probability that document i∈{1,…,M} has topic k∈{1,…,K}.
Choose ϕk∼Dir(β) (where k=1,…,K; ϕk∈ΔV)
    ϕk,v = probability of word v∈{1,…,V} in topic k∈{1,…,K}.
Choose ci,j∼Polynomial(θi) (where ci,j∈{1,…,K})
Choose wi,j∼Polynomial(ϕci,j) (where wi,j∈{1,…,V})

pLSA (Probabilistic Latent Semantic Analysis)

Instead of using matrices, Probabilistic Latent Semantic Analysis (PLSA) uses a probabilistic method. Its graphical model is

P(w∣d)=P(d)∑cP(c∣d)P(w∣c)

d = document index c = word's topic drawn from P(c∣d) w = word drawn from P(w∣c)

Both P(c∣d) and P(w∣c) are modeled as multinomial distributions. The parameters can be trained with EM or whatever.

One pitfall is the lack of parameters for P(d), so we don't know how to assign probability to a new document. Another is that the number of parameters for P(c∣d) grows linearly with the number of documents, which leads to overfitting.