Building a document recommendation system with LDA - derlin/bda-lsa-project GitHub Wiki

The use case for LDA are a bit different from those of SVD. Indeed, it is more a clustering technique than a recommandation tool.

Anyway, as explained here:

[The generated] topic distributions can be used in many ways:

  • Clustering: Topics are cluster centers and documents are associated with multiple clusters (topics). This clustering can help organize or summarize document collections.
  • Feature generation: LDA can generate features for other ML algorithms to use. As mentioned above, LDA infers a distribution over topics for each document; with k topics, this gives k numerical features. These features can then be plugged into algorithms such as Logistic Regression or Decision Trees for prediction tasks.
  • Dimensionality reduction: Each document’s distribution over topics gives a concise summary of the document. Comparing documents in this reduced feature space can be more meaningful than comparing in the original feature space of words.

To have the same doc to docs recommendation as the one in the SVD model, we could do two things.

a) basic way

To find related documents, we could simply take the most meaningful topics for the input document, take the x most meaningful document for those topics and finally sort the documents by their ponderation withing each topic. The result is a list of documents sorted by topic weight.

This is the approach we used, but it is not the most interesting. Indeed, we only take into account the top topics, not the whole distribution.

b) cosine similarity

What would be interesting would be to compute the similarity between documents using predictions techniques.

Since the LDAModel builds a topic distribution for each document of the corpus, we could then apply the cosine-based similarity measure between each document pair. Then, when given a document a in input, we first retrieve all the pairs (a, _), sort them by similarity and take the x most relevant ones.

The formula would be:

cosine-similarity

sim(i, j) = cos(vec_i, vec_j) = (vec_i * vec_j) / (norm(vec_i) x norm(vec_j))

with i, j two documents and vec_i, vec_j vectors of topic distributions.

Spark's DistributedLDAModel already provides the vectors of topic assignments for each document:

val distrib: RDD[(Long, Vector)] = model.topicDistributions()

Also, the mllib RowMatrix class has a columnSimilarities method to compute the cosinus between column vectors.

Computing the cosine similarity should thus be easy. Nonetheless, we didn't have the time to implement it. Some difficulties are:

  • the size of the corpus: more than 4 mio documents, so the result would be a 16 mio rows matrix...
  • we want the rows similarities, not the column similarities. Transforming the RDD into a matrix is trivial, but transposing it is not: we might loose the docIds informations
  • spark has many Matrix classes, each of them with different methods and behavior. Finding the proper one to use at a given moment and convertion between them makes it all complicated...

Resources