Code Walkthrough - sailuh/termite GitHub Wiki

Code Walkthrough

Pipeline

execute.py: Runs entire data processing pipeline and sets up client.
tokenize.py: Tokenize corpus
train_stmt/mallet.py: Train model
compute_saliency.py: Compute term saliency
compute_similarity.py: Compute term similarity
compute_seriation.py: Seriates terms
prepare_data_for_client.py: Generates datafiles for client
prepare_vis_for_client.py: Copies necessary scripts for client

--./execute.py --corpus-path <corpus_file> example_lda.cfg --model-path <any_path_for_model> --data-path <any_path_for_output>

Objective 1: Inspect the main() function to find out which flags are activated when running the file using above command line

The main function performs these steps:

Argument Parsing

Logging

1.5 execute function recheck the argument value to ensure these are NOT NONE and then enters these values along with current date and time in the logger info file.

Tokens

STMT/Mallet

1.7 The execute.py function then checks model_library and verifies if it's stmt or mallet-- I am not sure about the difference between two, I will look further into their differences and significance
1.8 Depending on the value of model_library it runs the corresponding .sh file and passes the tokens.txt file.
3.1 The train_mallet.sh expects three arguments, input-file, output-path and num-topics
3.2 It then creates the output folder.
3.3 It then calls the mallet executable file which is inside the mallet-2.0.7/bin
3.4 The mallet model vectorize the token.txt file and store it in text.vector file
3.5 The train-topic function of the mallet executable file is called to train the topic model from the imported data files. Here text.vector file act as the input.
3.6 Below output files are created by the mallet model using Hierarchical LDA model
- output.model
- output-topic-keys.txt
- topic-word-weights.txt
- word-topic-counts.txt
1.9 The execute.py file then calls the import_mallet.execute function
3.7 The import_mallet.execute function reads the output of the mallet executable file and evokes the extractTopicWordWeights function
3.8 The extractTopicWordWeights function reads the content of files and gives us below matrix
- term_tpoic_matrix
- term_index
- topic_index
3.9 Similarly, stmt also expects three arguments, input-file, output-path and num-topics
3.10 It loads lda-learn.scala file, this is the place where it connects with the lda algorithm and creates term-index, etc.
3.11 Creates and extracts below files
- unpack topic-term-distribution.csv
- Generate topic-index list
- Copy term-index list
- Extract doc-index list
- Extract list of term frequencies

Compute Saliency

1.10 The execute.py runs ComputeSaliency.execute function and passes data path as an argument.
4.1 The data path should have term-topic probability distribution stored in 3 separate files:
- 'term-topic-matrix.txt'- It contains the entries of the matrix.
- 'term-index.txt'- It contains the terms corresponding to the rows of the matrix.
- 'topic-index.txt' contains the topic labels corresponding to the columns of the matrix.
4.2 It then computes topic info, term info and rank the results.
4.3.1 It then writes the results into the disk
4.3.2 The output is a list of term distinctiveness and saliency values. The output is term info and topic info files in json and txt format.

Compute Similarity

1.11 The execute.py runs ComputeSimilarity.execute function and passes data path as an argument.

Compute Seriation

1.12 The execute.py runs ComputeSeriation.execute function and passes data path and number of seriated terms as arguments.

Prepare Data

1.13 The execute.py runs PrepareDataForClient.execute function and passes data path as an argument.

Prepare Visualization

1.14 The execute.py runs prepare_vis_for_client.sh which copies files necessary to run the client to the specified path’s public_html directory.
8.1 Prepare_vis_for_client.sh performs below actions:
- Navigates to public_html library
- Copy the js and css file to this folder

Output Folder Structure

The output folder structure, specified as a parameter when executing the code and on lines:

Has the following format:

├── model
│   ├── term-index.txt
│   ├── term-topic-matrix.txt
│   └── topic-index.txt
├── saliency
│   ├── term-info.json
│   ├── term-info.txt
│   ├── topic-info.json
│   └── topic-info.txt
├── similarity
│   └── combined-g2.txt
├── tokens
│   └── tokens.txt
└── topic-model
    ├── output-topic-keys.txt
    ├── output.model
    ├── text.vectors
    ├── topic-word-weights.txt
    └── word-topic-counts.txt