Using LargeLDA - sameerwadkar/largelda GitHub Wiki
Using Large LDA
Create a file known as TopicModeling.properties. See sample in folder src/main/resources/TopicModeling.properties
Create the following directories which will be referenced in the above properties file. Change them as required. Remember to update the TopicModeling.properties accordingly. The below example is only a sample
a. C:/topicmodel/wd/
b. C:/topicmodel/data/stoplists/ - Copy your stoplist file into it. Sample is provided at src/main/resources/stoplists folder
c. Create a sample training data folder. You input text file goes here. Ex. C:/topicmodel/data/training/
i. The training data has three fields which are tab separated.
NAME - Name of the file
MARKER- X
DATA - Text for each file
See sample in the folder src/main/resources/ap.txt (about 2000 documents) or src/main/resources/patentabstracts.txt (Half a million documents)
d. Create a testing data folder. The file representing documents which are tested by the generated model go here. Ex. C:/topicmodel/data/training/. The format is tab separated fields as follows
i. NAME
ii. DATA
A sample file is available at src/main/testing/ap.txt.
e. Create a folder to save the generated model. Ex. C:/topicmodel/model/patents.ser
Update the following properties
#The working directory for the program. Once created do not delete this folder
PLDA_WORKING_DIR=C:/topicmodel/wd/
#Path to the stoplist file
PLDA_STOPLIST_FILE=C:/topicmodel/data/stoplists/en.txt
#Iteration Count. Higher the number of iterations, the better the quality of the model. Use a number between 500-2000
PLDA_ITERATION_CNT=500
#Input file path
PLDA_MODEL_TRAINING_DATA_FILE=C:/topicmodel/data/training/patentabstracts.txt
#No of topics (Use your judgement)
PLDA_TOPIC_CNT=400
#No of parallel threads to use
PLDA_THREAD_CNT=5
#Size of the thread pool
PLDA_THREADPOOL_SIZE=5
#LDA Parameter (next two properties)
PLDA_ALPHA=1.0
PLDA_BETA=0.1
#No of words to print per topic
PLDA_NO_OF_WORDS_PER_TOPIC=10
#Prints to console once every n iterations
PLDA_PRINT_EVERY_N_ITERATIONS=10
#All probabilities below this number are ignored. Recommended threshold 10%
PLDA_PROB_THRESHOLD=0.1
#Path where the model is saved. Only saved at the end of the program.
PLDA_MODEL_PATH=C:/topicmodel/model/patents.ser
#Testing File for new files against which to use the model
PLDA_MODEL_TESTING_DATA_FILE=C:/topicmodel/data/testing/patents.txt
#Output folder for the testing file
PLDA_MODEL_TESTING_OUT_FILE=C:/topicmodel/data/testing/patent_results.csv