InferenceModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline


Infer Documents Module

The Infer Documents module generates the topic distribution of new documents from an existing model. For that, it reads a model previously built by the Topic Model module and new documents lemmatised by the Lemmatise Module. The new distributions are saved in a Document JSON file and optionally in Document CSV file(s).

The Infer Documents module is contained in the P3_TopicModelling package.

Specifications

mainTopicsOutput // merge main topis output !opt def to "" (output dir) mainTopics // main topic json file, req if mainTopicsOutput set, (modelDir) subTopicsOutput // merge sub topics output !opt def to "" (output dir) subTopics // sub topic json file, req if subTOpicOutput set, (modelDir)

The Infer Documents module entry in the project file should have the following structure:

{...
  "inferDocuments": {
    "lemmas": "path",
    "modelDir": "path",
    "model" | "mainModel": "path",
    "subModel": "path",
    "iterations": 1000,
    "outputDir": "path",
    "csvOutput": "path",
    "docFields": ["key", ... ],
    "numWordId": 3,
    "documentsOutput": "path",
    "documents": "path",
    "topicsOutput" | "mainTopicsOutput": "path",
    "topics" | "mainTopics": "path",
    "subTopicsOutput": "path",
    "subTopics": "path",
  },
...}
Name Description Optional Default
lemmas Path to the lemmas JSON file with documents to infer * No
modelDir Path to the directory containing all the original model data files ** Yes ""
model or mainModel (if model is hierarchical) Path to the serialised (main) topic model *** No
subModel Path to the serialised (sub) topic model *** Required if the model is hierarchical "" ****
iterations Number of iterations for the inference Yes 100
outputDir Path to the directory that will contain all the data generated * Yes ""
csvOutput Path to the document CSV file exporting the inference data ***** Yes "" (no export)
docFields List of keys, in documents' docData, to export on the document CSV file ****** Yes []
numWordId Number of labels used to identify topics in the document CSV file Yes 3
documentsOutput Path to the documents JSON file exporting the merged list of original model documents and inferred documents ***** Yes "" (no export)
documents Path to the original documents JSON file *** Required if exporting documentsOutput
topicsOutput or mainTopicsOutput (if model is hierarchical) Path to the topics JSON file exporting the list of (main) topics, with data from the inferred documents included ***** Yes "" (no export)
topics or mainTopics (if model is hierarchical) Path to the original (main) topics JSON file *** Required if exporting topicsOutput (or mainTopicsOutput)
subTopicsOutput Path to the topics JSON file exporting the list of sub topics, with data from the inferred documents included ***** Yes "" (no export)
subTopics Path to the original sub topics JSON file *** Required if exporting subTopicsOutput

Output

The Infer Document module can output multiple files.

First, the document JSON file, which follows a similar structure to the document file generated by the Topic Model Modules:

{
  "metadata":{...
    "nTopicsMain":20,
    "nTopicsSub":30
  },
  "documents":[
    {
      "docId":"0",
      "docIndex":0,
      "numLemmas":107,
      "docData":{"key": "value", ...},
      "mainTopicDistribution":[ ... ],
      "subTopicDistribution":[ ... ]
    },{
      "docId": "1",
      "docIndex": 1,
      "tooShort": true,
      "numLemmas": 2,
      "docData": {"key": "value2", ...}
    },...
  ]
}

As with the Topic Model Modules, this file builds on top of the lemma JSON file. To the metadata, it adds:

  • the number of topics in the main model, nTopicsMain;
  • the number of topics in the sub model, nTopicsSub, if a sub model is also used for inference.

To the documents it adds (if the documents was not removed):

  • mainTopicDistribution the list of topic weights from the main model;
  • subTopicDistribution the list of topic weights from the sub model, if used for inference.

Note that the docData will also be adjusted to fit with the docFields specification.

Then, the document CSV file(s), if set in the specifications, following this structure:

"_docId", "key1",   "key2",   ..., "_wordCount", "_inModel", "_inferred", "_mainTopic_topic-1-labels", "_mainTopic_topic-2-labels", ...
"0",      "value1", "value2", ..., "107",        "true",     "true",      "0.0197",                    "0.0099",                    ...

Each row represents a document, with key1, key2, etc. being the keys set in docFields. The CSV also includes the wordCount per document, whether the document was included in the model or not, and that its topic distribution has been inferred. Finally, for each topic, identified by a list of their top labels, there is the weight of that topic in the document. Each topic identifier is also annotated with either _mainTopic_ or _subTopic_ to help identify which model they are from.

⚠️ **GitHub.com Fallback** ⚠️