ExportModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline


Export Model Module

The Export Model module gathers the data generated by the Topic Model module to generate concise model data that can be uses by other applications. These data are saved as Topic JSON file(s) and optionally as Document CSV file(s).

The Export Model module is contained in the P3_TopicModelling package.

Specifications

The Export Model module entry in the project file should have the following structure:

{...
  "exportTopicModel": {
    "topics" | "mainTopics": "path",
    "subTopics": "path",
    "documents": "path",
    "docFields": ["key", ... ],
    "output" | "mainOutput": "path",
    "subOutput": "path",
    "mainOutputCSV": "path",
    "subOutputCSV": "path",
    "outputCSV": "path",
    "numWordId": 3
  },
...}
Name Description Optional Default
topics or mainTopics (if the model is hierarchical) Path to the (main) topics JSON file * No
subTopics Path to the sub topics JSON file * Required if the model is hierarchical "" **
documents Path to the documents JSON file * No
docFields List of keys, in documents' docData, to export on file (JSON and CSV) *** Yes []
output or mainOutput (if the model is hierarchical) Path to the (main) topics JSON file exported **** Yes "" (no export)
subOutput Path to the sub topics JSON file exported **** Yes "" (no export)
mainOutputCSV Path to the document CSV file listing documents and their weights in main topics **** Yes "" (no export)
subOutputCSV Path to the document CSV file listing documents and their weights in sub topics **** Yes "" (no export)
outputCSV Path to the document CSV file listing documents and their weights in both main and sub topics, note that is the model is non-hierarchical this is equivalent to mainOutputCSV **** Yes "" (no export)
numWordId Number of labels used to identify topics in document CSV files Yes 3

Output

The Export Model module outputs multiple files.

First, the topic JSON file, which follows a similar structure to the topic files generated by the Topic Model Modules:

{
  "metadata": { ... },
  "topics": [
    {
      "topicId": "0",
      "topicIndex": 0,
      "topDocs": [{
        "docId": "id", 
        "weight": 0.7778, 
        "docData": {
          "wordCount": 100,
          "key1": "value1",
          "key2": "value2",
          ...
        }
`     }, ... ],
      "topWords": [{"label": "risk", "weight": 85.0}, ... ],
      "subTopicIds": [ ... ],
      "mainTopicIds": [ ... ]
    }, ...
  ],
}

Note that docData has been added to each top document, containing a list of key-value pairs, following the docFields specification, as well as the wordCount for that document.

Then, the document CSV file, if set in the specifications, following this structure:

"_docId", "key1",   "key2",   ..., "_wordCount", "_inModel", "_inferred", "_mainTopic_topic-1-labels", "_mainTopic_topic-2-labels", ...
"0",      "value1", "value2", ..., "107",        "true",     "false",     "0.0197",                    "0.0099",                    ...

Each row represents a document, with key1, key2, etc. being the keys set in docFields. The CSV also includes the wordCount per document, whether the document was included in the model or not, and whether the document was inferred or not (this is only set if at least one document was inferred). Finally, for each topic, identified by a list of their top labels, there is the weight of that topic in the document. Each topic identifier is also annotated with either _mainTopic_ or _subTopic_ to help identify which model they are from.

⚠️ **GitHub.com Fallback** ⚠️