ICP_6 - Girees737/KDM_Projects GitHub Wiki

ICP-6

Name: Gireesh Kumar Muppalla Email: [email protected]

Softwares Required:

Jupyter Notebook, Python

Tasks:

  1. Download the data and source code from the following link and make sure the source code is running at your end:https://umkc.box.com/s/yatsidy8xqg3qwo9gm6ao03dvq4vjnjg
  2. Run the source code multiple times (7 to 10timesminimum) and in each of these runs do the following: a. Pick 3 hyperparameters and change their values (you may need to find the allowable values for those hyper parameters). Note once you picked the 3 hyperparameters you need to use them for all the other runs but only need to change the values. b. Find the coherence score and record it. c. show your pyLDAvis visualization.
  3. Write down thecoherence scores, the 3 hyperparameters and their values, and compare them with each other and with the base line source code values (coherence score = 0.423).
  4. Justify your selection of 3 hyperparameters and their respective values in each of these runs.
  5. In your opinion, for each of these 3 hyperparameters what is the best value and why ?

Implementation:

  1. Installed the required libraries to perform LDA.
  2. Selected 7 set of hyperparameters for 7 runs.
  3. Ran them for each time and observed their coherence scores, similarities and LDA plot.

With given hyperparameters num_topics = 7, passes=22, alpha='auto':

LDA Plot:

With hyperparameters num_topics = 5, passes=18, alpha=0.1:

LDA Plot:

With hyperparameters num_topics = 8, passes=20, alpha='symmetric':

Coherence score:

LDA Plot:

With hyperparameters num_topics = 4, passes=22, alpha='auto':

Coherence score:

LDA Plot:

With hyperparameters num_topics = 3, passes=10, alpha=0.01:

Coherence score:

LDA Plot:

With hyperparameters num_topics = 10, passes=15, alpha='symmetric':

Coherence score:

LDA Plot:

With hyperparameters num_topics = 8, passes=20, alpha='auto':

Coherence score:

Justification of selection of 3 hyperparameters and their respective values in each of these runs.

I have selected the different ranges of topics in a way that they are most dissimilar to each other.

In your opinion, for each of these 3 hyperparameters what is the best value and why ?

As per the observation, the one with 5 topics seems the best hyperparameter for the given text as the most of the topics are non overlapping and dissimiliar.