Using LargeLDA - sameerwadkar/largelda GitHub Wiki

Using Large LDA

Create a file known as TopicModeling.properties. See sample in folder src/main/resources/TopicModeling.properties
Create the following directories which will be referenced in the above properties file. Change them as required. Remember to update the TopicModeling.properties accordingly. The below example is only a sample a. C:/topicmodel/wd/ b. C:/topicmodel/data/stoplists/ - Copy your stoplist file into it. Sample is provided at src/main/resources/stoplists folder c. Create a sample training data folder. You input text file goes here. Ex. C:/topicmodel/data/training/ i. The training data has three fields which are tab separated. NAME - Name of the file MARKER- X DATA - Text for each file See sample in the folder src/main/resources/ap.txt (about 2000 documents) or src/main/resources/patentabstracts.txt (Half a million documents) d. Create a testing data folder. The file representing documents which are tested by the generated model go here. Ex. C:/topicmodel/data/training/. The format is tab separated fields as follows i. NAME ii. DATA A sample file is available at src/main/testing/ap.txt. e. Create a folder to save the generated model. Ex. C:/topicmodel/model/patents.ser
Update the following properties #The working directory for the program. Once created do not delete this folder PLDA_WORKING_DIR=C:/topicmodel/wd/ #Path to the stoplist file PLDA_STOPLIST_FILE=C:/topicmodel/data/stoplists/en.txt #Iteration Count. Higher the number of iterations, the better the quality of the model. Use a number between 500-2000 PLDA_ITERATION_CNT=500 #Input file path PLDA_MODEL_TRAINING_DATA_FILE=C:/topicmodel/data/training/patentabstracts.txt #No of topics (Use your judgement) PLDA_TOPIC_CNT=400 #No of parallel threads to use PLDA_THREAD_CNT=5 #Size of the thread pool PLDA_THREADPOOL_SIZE=5 #LDA Parameter (next two properties) PLDA_ALPHA=1.0 PLDA_BETA=0.1 #No of words to print per topic PLDA_NO_OF_WORDS_PER_TOPIC=10 #Prints to console once every n iterations PLDA_PRINT_EVERY_N_ITERATIONS=10 #All probabilities below this number are ignored. Recommended threshold 10% PLDA_PROB_THRESHOLD=0.1 #Path where the model is saved. Only saved at the end of the program. PLDA_MODEL_PATH=C:/topicmodel/model/patents.ser #Testing File for new files against which to use the model PLDA_MODEL_TESTING_DATA_FILE=C:/topicmodel/data/testing/patents.txt #Output folder for the testing file PLDA_MODEL_TESTING_OUT_FILE=C:/topicmodel/data/testing/patent_results.csv