InferenceModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
The Infer Documents module generates the topic distribution of new documents from an existing model. For that, it reads a model previously built by the Topic Model module and new documents lemmatised by the Lemmatise Module. The new distributions are saved in a Document JSON file and optionally in Document CSV file(s).
The Infer Documents module is contained in the P3_TopicModelling
package.
mainTopicsOutput // merge main topis output !opt def to "" (output dir) mainTopics // main topic json file, req if mainTopicsOutput set, (modelDir) subTopicsOutput // merge sub topics output !opt def to "" (output dir) subTopics // sub topic json file, req if subTOpicOutput set, (modelDir)
The Infer Documents module entry in the project file should have the following structure:
{...
"inferDocuments": {
"lemmas": "path",
"modelDir": "path",
"model" | "mainModel": "path",
"subModel": "path",
"iterations": 1000,
"outputDir": "path",
"csvOutput": "path",
"docFields": ["key", ... ],
"numWordId": 3,
"documentsOutput": "path",
"documents": "path",
"topicsOutput" | "mainTopicsOutput": "path",
"topics" | "mainTopics": "path",
"subTopicsOutput": "path",
"subTopics": "path",
},
...}
Name | Description | Optional | Default |
---|---|---|---|
lemmas |
Path to the lemmas JSON file with documents to infer * | No | |
modelDir |
Path to the directory containing all the original model data files ** | Yes | "" |
model or mainModel (if model is hierarchical) |
Path to the serialised (main) topic model *** | No | |
subModel |
Path to the serialised (sub) topic model *** | Required if the model is hierarchical |
"" **** |
iterations |
Number of iterations for the inference | Yes | 100 |
outputDir |
Path to the directory that will contain all the data generated * | Yes | "" |
csvOutput |
Path to the document CSV file exporting the inference data ***** | Yes |
"" (no export) |
docFields |
List of keys, in documents' docData , to export on the document CSV file ****** |
Yes | [] |
numWordId |
Number of labels used to identify topics in the document CSV file | Yes | 3 |
documentsOutput |
Path to the documents JSON file exporting the merged list of original model documents and inferred documents ***** | Yes |
"" (no export) |
documents |
Path to the original documents JSON file *** | Required if exporting documentsOutput
|
|
topicsOutput or mainTopicsOutput (if model is hierarchical) |
Path to the topics JSON file exporting the list of (main) topics, with data from the inferred documents included ***** | Yes |
"" (no export) |
topics or mainTopics (if model is hierarchical) |
Path to the original (main) topics JSON file *** | Required if exporting topicsOutput (or mainTopicsOutput ) |
|
subTopicsOutput |
Path to the topics JSON file exporting the list of sub topics, with data from the inferred documents included ***** | Yes |
"" (no export) |
subTopics |
Path to the original sub topics JSON file *** | Required if exporting subTopicsOutput
|
- * These paths are relative to the data directory;
- ** This path is relative to the source directory;
- *** These paths are relative to
modelDir
; - **** This default value implies a non-hierarchical model, if the model type meta-parameter is set to
hierarchical
, a path must be provided; - ***** These paths are relative to
outputDir
; - ****** This gets overwritten by the document fields meta-parameter (if set).
The Infer Document module can output multiple files.
First, the document JSON file, which follows a similar structure to the document file generated by the Topic Model Modules:
{
"metadata":{...
"nTopicsMain":20,
"nTopicsSub":30
},
"documents":[
{
"docId":"0",
"docIndex":0,
"numLemmas":107,
"docData":{"key": "value", ...},
"mainTopicDistribution":[ ... ],
"subTopicDistribution":[ ... ]
},{
"docId": "1",
"docIndex": 1,
"tooShort": true,
"numLemmas": 2,
"docData": {"key": "value2", ...}
},...
]
}
As with the Topic Model Modules, this file builds on top of the lemma JSON file. To the metadata
,
it adds:
- the number of topics in the main model,
nTopicsMain
; - the number of topics in the sub model,
nTopicsSub
, if a sub model is also used for inference.
To the documents
it adds (if the documents was not removed):
-
mainTopicDistribution
the list of topic weights from the main model; -
subTopicDistribution
the list of topic weights from the sub model, if used for inference.
Note that the docData
will also be adjusted to fit with the docFields
specification.
Then, the document CSV file(s), if set in the specifications, following this structure:
"_docId", "key1", "key2", ..., "_wordCount", "_inModel", "_inferred", "_mainTopic_topic-1-labels", "_mainTopic_topic-2-labels", ...
"0", "value1", "value2", ..., "107", "true", "true", "0.0197", "0.0099", ...
Each row represents a document, with key1
, key2
, etc. being the keys set in docFields
. The CSV also includes the
wordCount
per document, whether the document was included in the model or not, and that its topic distribution has
been inferred. Finally, for each topic, identified by a list of their top labels, there is the weight of
that topic in the document. Each topic identifier is also annotated with either _mainTopic_
or _subTopic_
to help
identify which model they are from.