Conclusion - derlin/bda-lsa-project GitHub Wiki
In the end, we learned a lot about different concepts involved in Latent Semantic Analysis. Here's what we get out of it :
Preprocessing
- Using a pipeline and saving outputs on HDFS saves a lot of time. Most computations require a lot of it, so saving hours can really make a difference in how fast the work is advancing in the long run.
- Mapping between documentIds and titles are fragile in the model we used. It's important to save them.
- Everything takes time ! Planning and running things as soon as we can is a best practice.
Results
SVD
SVD is rather powerful and works quite well in some use cases. For example, it is really efficient when used as recommendation system, using topDocsForDoc
and such.
LDA
LDA has similarities with SVD in its functionalities, but some also differ. It is for example useful to see topics and to classify new documents, but more difficult to use in a recommendation system because we always have to use topics.
About our results : We tried using 1'000 topics at first but the results were unsatisfying. Using 200 topics yields much more interesting and concise topics. Also, the number of iterations plays an important role in the final quality of the topics. The default number of 10 iterations is unusable.
Comparison
SVD and SDA have their own caracteristics :
- SVD => Ordered and consistents topics. Multiple executions = same results.
- SDA => Random order due to the clustering. Gives different results each execution
In the end, these two models don't give the same topics, but both seem interesting in their own way.
Perspectives
About Pre-processing : Even after lemmatization etc, we still find words like "new" that could be considered stopwords. An improvement could be to change the vocabulary size (we took 20'000). A higher or smaller number could lead to more interesting results. Also, HashingTF
may be useful to increase performances.
Model tuning is super important for SDA : alpha, beta and k are important parameters. We could have tried more different values to see the impoact on the result but were still quite happy with the results using default values.
The QueryEngine can also be improved. We used a simple algorithm to get the results we presented. It could be improved by using the cosine similarity to recommend documents for example.
And finally, an implementation improvement would be to use Pipelines. It would make the whole work of saving and reloading models easier.