Code Walkthrough - sailuh/termite GitHub Wiki
Code Walkthrough
Pipeline
- execute.py: Runs entire data processing pipeline and sets up client.
- tokenize.py: Tokenize corpus
- train_stmt/mallet.py: Train model
- compute_saliency.py: Compute term saliency
- compute_similarity.py: Compute term similarity
- compute_seriation.py: Seriates terms
- prepare_data_for_client.py: Generates datafiles for client
- prepare_vis_for_client.py: Copies necessary scripts for client
--./execute.py --corpus-path <corpus_file> example_lda.cfg --model-path <any_path_for_model> --data-path <any_path_for_output>
Objective 1: Inspect the main() function to find out which flags are activated when running the file using above command line
The main function performs these steps:
Argument Parsing
Logging
Tokens
-
2.1 In case this tokenization is NONE, the execute functions set the tokenization value accordingly.
STMT/Mallet
-
1.7 The execute.py function then checks model_library and verifies if it's stmt or mallet-- I am not sure about the difference between two, I will look further into their differences and significance
-
3.1 The train_mallet.sh expects three arguments, input-file, output-path and num-topics
-
3.3 It then calls the mallet executable file which is inside the mallet-2.0.7/bin
-
3.4 The mallet model vectorize the token.txt file and store it in text.vector file
-
3.6 Below output files are created by the mallet model using Hierarchical LDA model
- output.model
- output-topic-keys.txt
- topic-word-weights.txt
- word-topic-counts.txt
-
1.9 The execute.py file then calls the import_mallet.execute function
-
3.8 The extractTopicWordWeights function reads the content of files and gives us below matrix
- term_tpoic_matrix
- term_index
- topic_index
-
3.9 Similarly, stmt also expects three arguments, input-file, output-path and num-topics
-
3.11 Creates and extracts below files
- unpack topic-term-distribution.csv
- Generate topic-index list
- Copy term-index list
- Extract doc-index list
- Extract list of term frequencies
Compute Saliency
-
1.10 The execute.py runs ComputeSaliency.execute function and passes data path as an argument.
-
4.1 The data path should have term-topic probability distribution stored in 3 separate files:
- 'term-topic-matrix.txt'- It contains the entries of the matrix.
- 'term-index.txt'- It contains the terms corresponding to the rows of the matrix.
- 'topic-index.txt' contains the topic labels corresponding to the columns of the matrix.
-
4.2 It then computes topic info, term info and rank the results.
Compute Similarity
Compute Seriation
Prepare Data
Prepare Visualization
-
8.1 Prepare_vis_for_client.sh performs below actions:
- Navigates to public_html library
- Copy the js and css file to this folder
Output Folder Structure
The output folder structure, specified as a parameter when executing the code and on lines:
Has the following format:
βββ model
β βββ term-index.txt
β βββ term-topic-matrix.txt
β βββ topic-index.txt
βββ saliency
β βββ term-info.json
β βββ term-info.txt
β βββ topic-info.json
β βββ topic-info.txt
βββ similarity
β βββ combined-g2.txt
βββ tokens
β βββ tokens.txt
βββ topic-model
βββ output-topic-keys.txt
βββ output.model
βββ text.vectors
βββ topic-word-weights.txt
βββ word-topic-counts.txt