ExportModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
The Export Model module gathers the data generated by the Topic Model module to generate concise model data that can be uses by other applications. These data are saved as Topic JSON file(s) and optionally as Document CSV file(s).
The Export Model module is contained in the P3_TopicModelling
package.
The Export Model module entry in the project file should have the following structure:
{...
"exportTopicModel": {
"topics" | "mainTopics": "path",
"subTopics": "path",
"documents": "path",
"docFields": ["key", ... ],
"output" | "mainOutput": "path",
"subOutput": "path",
"mainOutputCSV": "path",
"subOutputCSV": "path",
"outputCSV": "path",
"numWordId": 3
},
...}
Name | Description | Optional | Default |
---|---|---|---|
topics or mainTopics (if the model is hierarchical) |
Path to the (main) topics JSON file * | No | |
subTopics |
Path to the sub topics JSON file * | Required if the model is hierarchical |
"" ** |
documents |
Path to the documents JSON file * | No | |
docFields |
List of keys, in documents' docData , to export on file (JSON and CSV) *** |
Yes | [] |
output or mainOutput (if the model is hierarchical) |
Path to the (main) topics JSON file exported **** | Yes |
"" (no export) |
subOutput |
Path to the sub topics JSON file exported **** | Yes |
"" (no export) |
mainOutputCSV |
Path to the document CSV file listing documents and their weights in main topics **** | Yes |
"" (no export) |
subOutputCSV |
Path to the document CSV file listing documents and their weights in sub topics **** | Yes |
"" (no export) |
outputCSV |
Path to the document CSV file listing documents and their weights in both main and sub topics, note that is the model is non-hierarchical this is equivalent to mainOutputCSV **** |
Yes |
"" (no export) |
numWordId |
Number of labels used to identify topics in document CSV files | Yes | 3 |
- * These paths are relative to the data directory;
- ** This default value implies a non-hierarchical model, if the model type meta-parameter is set to
hierarchical
, a path must be provided; - *** This gets overwritten by the document fields meta-parameter (if set);
- **** These paths are relative to the output directory.
The Export Model module outputs multiple files.
First, the topic JSON file, which follows a similar structure to the topic files generated by the Topic Model Modules:
{
"metadata": { ... },
"topics": [
{
"topicId": "0",
"topicIndex": 0,
"topDocs": [{
"docId": "id",
"weight": 0.7778,
"docData": {
"wordCount": 100,
"key1": "value1",
"key2": "value2",
...
}
` }, ... ],
"topWords": [{"label": "risk", "weight": 85.0}, ... ],
"subTopicIds": [ ... ],
"mainTopicIds": [ ... ]
}, ...
],
}
Note that docData
has been added to each top document, containing a list of key-value pairs, following the
docFields
specification, as well as the wordCount
for that document.
Then, the document CSV file, if set in the specifications, following this structure:
"_docId", "key1", "key2", ..., "_wordCount", "_inModel", "_inferred", "_mainTopic_topic-1-labels", "_mainTopic_topic-2-labels", ...
"0", "value1", "value2", ..., "107", "true", "false", "0.0197", "0.0099", ...
Each row represents a document, with key1
, key2
, etc. being the keys set in docFields
. The CSV also includes the
wordCount
per document, whether the document was included in the model or not, and whether the document was inferred
or not (this is only set if at least one document was inferred). Finally, for each topic, identified by a list of their
top labels, there is the weight of that topic in the document. Each topic identifier is also annotated with either
_mainTopic_
or _subTopic_
to help identify which model they are from.