ICP6 - smqhw/kdm1 GitHub Wiki
- What have you learned in the ICP
In this ICP i have learnt Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling implementations in the Python's Gensim package. In that i have added 3 hyper parameters chunksize, passes and random-state by changing these values i observed that the values of each topic is changing from base line value and the coherence measures the relative distance between words within a topic. From that visualization of the same topics were concatenated with one another or else they separate each other. There are two major types C_V typically 0 < x < 1 and uMass -14 < x < 14. It's rare to see a coherence of 1 or +.9 unless the words being measured are either identical words or bigrams. Like United and States would likely return a coherence score of ~.94 or hero and hero would return a coherence of 1. The overall coherence score of a topic is the average of the distances between words. I try and attain a .7 in my LDAs if I'm using cv I think that is a strong topic correlation. I would say:.3 is bad, .4 is low, .55 is okay, .65 might be as good as it is going to get, .7 is nice, .8 is unlikely and .9 is probably wrong
-
The main objective is to observing the coherence score by changing the values of hyperparameters and displaying the each topic how it is performing its size and overlapping. In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.
-
Design/ Implementations
<img src=(https://user-images.githubusercontent.com/77996454/109592403-35912d80-7add-11eb-856e-d780b7cfb40f.png" alt="alt text" width="700" height="350"> ,img src="https://user-images.githubusercontent.com/77996454/109592416-3e81ff00-7add-11eb-94e8-32346a224058.png" alt="alt text" width="700" height="350">
- Video Added in code ICP6 video file
5.Conclusion
Overall when the hyperparameter values are changed then simultaneously coherence score and visualizing is also getting change whenever we run code for multiple times the visualization of topic models are concatenating or separating each other when the coherence score is >4 then it is good gives average of the distances between words.