Code Walkthrough - sailuh/termite GitHub Wiki

Code Walkthrough

Pipeline

  1. execute.py: Runs entire data processing pipeline and sets up client.
  2. tokenize.py: Tokenize corpus
  3. train_stmt/mallet.py: Train model
  4. compute_saliency.py: Compute term saliency
  5. compute_similarity.py: Compute term similarity
  6. compute_seriation.py: Seriates terms
  7. prepare_data_for_client.py: Generates datafiles for client
  8. prepare_vis_for_client.py: Copies necessary scripts for client
--./execute.py --corpus-path <corpus_file> example_lda.cfg --model-path <any_path_for_model> --data-path <any_path_for_output>

Objective 1: Inspect the main() function to find out which flags are activated when running the file using above command line

The main function performs these steps:

Argument Parsing

Logging

Tokens

STMT/Mallet

Compute Saliency

Compute Similarity

Compute Seriation

Prepare Data

Prepare Visualization

Output Folder Structure

The output folder structure, specified as a parameter when executing the code and on lines:

Has the following format:

β”œβ”€β”€ model
β”‚   β”œβ”€β”€ term-index.txt
β”‚   β”œβ”€β”€ term-topic-matrix.txt
β”‚   └── topic-index.txt
β”œβ”€β”€ saliency
β”‚   β”œβ”€β”€ term-info.json
β”‚   β”œβ”€β”€ term-info.txt
β”‚   β”œβ”€β”€ topic-info.json
β”‚   └── topic-info.txt
β”œβ”€β”€ similarity
β”‚   └── combined-g2.txt
β”œβ”€β”€ tokens
β”‚   └── tokens.txt
└── topic-model
    β”œβ”€β”€ output-topic-keys.txt
    β”œβ”€β”€ output.model
    β”œβ”€β”€ text.vectors
    β”œβ”€β”€ topic-word-weights.txt
    └── word-topic-counts.txt