Topic Mapping Pipeline

Infer Documents Module

The Infer Documents module generates the topic distribution of new documents from an existing model. For that, it reads a model previously built by the Topic Model module and new documents lemmatised by the Lemmatise Module. The new distributions are saved in a Document JSON file and optionally in Document CSV file(s).

The Infer Documents module is contained in the P3_TopicModelling package.

Specifications

mainTopicsOutput // merge main topis output !opt def to "" (output dir) mainTopics // main topic json file, req if mainTopicsOutput set, (modelDir) subTopicsOutput // merge sub topics output !opt def to "" (output dir) subTopics // sub topic json file, req if subTOpicOutput set, (modelDir)

The Infer Documents module entry in the project file should have the following structure:

{...
  "inferDocuments": {
    "lemmas": "path",
    "modelDir": "path",
    "model" | "mainModel": "path",
    "subModel": "path",
    "iterations": 1000,
    "outputDir": "path",
    "csvOutput": "path",
    "docFields": ["key", ... ],
    "numWordId": 3,
    "documentsOutput": "path",
    "documents": "path",
    "topicsOutput" | "mainTopicsOutput": "path",
    "topics" | "mainTopics": "path",
    "subTopicsOutput": "path",
    "subTopics": "path",
  },
...}

Name	Description	Optional	Default
`lemmas`	Path to the lemmas JSON file with documents to infer *	No
`modelDir`	Path to the directory containing all the original model data files **	Yes	`""`
`model` or `mainModel` (if model is hierarchical)	Path to the serialised (main) topic model ***	No
`subModel`	Path to the serialised (sub) topic model ***	Required if the model is hierarchical	`""` ****
`iterations`	Number of iterations for the inference	Yes	`100`
`outputDir`	Path to the directory that will contain all the data generated *	Yes	`""`
`csvOutput`	Path to the document CSV file exporting the inference data *****	Yes	`""` (no export)
`docFields`	List of keys, in documents' `docData`, to export on the document CSV file ******	Yes	`[]`
`numWordId`	Number of labels used to identify topics in the document CSV file	Yes	`3`
`documentsOutput`	Path to the documents JSON file exporting the merged list of original model documents and inferred documents *****	Yes	`""` (no export)
`documents`	Path to the original documents JSON file ***	Required if exporting `documentsOutput`
`topicsOutput` or `mainTopicsOutput` (if model is hierarchical)	Path to the topics JSON file exporting the list of (main) topics, with data from the inferred documents included *****	Yes	`""` (no export)
`topics` or `mainTopics` (if model is hierarchical)	Path to the original (main) topics JSON file ***	Required if exporting `topicsOutput` (or `mainTopicsOutput`)
`subTopicsOutput`	Path to the topics JSON file exporting the list of sub topics, with data from the inferred documents included *****	Yes	`""` (no export)
`subTopics`	Path to the original sub topics JSON file ***	Required if exporting `subTopicsOutput`

* These paths are relative to the data directory;
** This path is relative to the source directory;
*** These paths are relative to modelDir;
**** This default value implies a non-hierarchical model, if the model type meta-parameter is set to hierarchical, a path must be provided;
***** These paths are relative to outputDir;
****** This gets overwritten by the document fields meta-parameter (if set).

Output

The Infer Document module can output multiple files.

First, the document JSON file, which follows a similar structure to the document file generated by the Topic Model Modules:

{
  "metadata":{...
    "nTopicsMain":20,
    "nTopicsSub":30
  },
  "documents":[
    {
      "docId":"0",
      "docIndex":0,
      "numLemmas":107,
      "docData":{"key": "value", ...},
      "mainTopicDistribution":[ ... ],
      "subTopicDistribution":[ ... ]
    },{
      "docId": "1",
      "docIndex": 1,
      "tooShort": true,
      "numLemmas": 2,
      "docData": {"key": "value2", ...}
    },...
  ]
}

As with the Topic Model Modules, this file builds on top of the lemma JSON file. To the metadata, it adds:

the number of topics in the main model, nTopicsMain;
the number of topics in the sub model, nTopicsSub, if a sub model is also used for inference.

To the documents it adds (if the documents was not removed):

mainTopicDistribution the list of topic weights from the main model;
subTopicDistribution the list of topic weights from the sub model, if used for inference.

Note that the docData will also be adjusted to fit with the docFields specification.

Then, the document CSV file(s), if set in the specifications, following this structure:

"_docId", "key1",   "key2",   ..., "_wordCount", "_inModel", "_inferred", "_mainTopic_topic-1-labels", "_mainTopic_topic-2-labels", ...
"0",      "value1", "value2", ..., "107",        "true",     "true",      "0.0197",                    "0.0099",                    ...

Each row represents a document, with key1, key2, etc. being the keys set in docFields. The CSV also includes the wordCount per document, whether the document was included in the model or not, and that its topic distribution has been inferred. Finally, for each topic, identified by a list of their top labels, there is the weight of that topic in the document. Each topic identifier is also annotated with either _mainTopic_ or _subTopic_ to help identify which model they are from.

InferenceModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline

Infer Documents Module

Specifications

Output

⚠️ GitHub.com Fallback ⚠️

InferenceModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline

Infer Documents Module

Specifications

Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️