Topic Mapping Pipeline

Topic Model Modules

The Topic Model modules use the lemmas data previously created to create one or more topic model(s) from it. The lemmas data is then saved in a Document JSON file and the generated topics are saved in Topic JSON file(s).

The Topic Model modules are contained in the P3_TopicModelling package.

List of Topic Model Modules

There are two Topic Model modules:

Topic Modelling (TopicModelling.java class), which samples a single model from the lemmas;
Hierarchical Topic Modelling (HierarchicalTopicModelling.java class), which samples two separate topic models (using the Topic Modelling module), a main model and a sub model, and creates an assignment between the two to create a two-layers hierarchical model.

Specifications

The Topic Model module entry in the project file should has the following structure:

{...
  "model": {
    "lemmas": "path",
    "modelType": "module name",
    "outputDir" | "dataDir": "path",
    "documentOutput": "path",
    "model" | "mainModel": { ... },
    "subModel": { ... },
    "hierarchy": { ... }
  }
...}

Name	Description	Optional	Default
`lemmas`	Path to the lemmatised documents file *	No
`outputDir` or `dataDir`	Path to the directory where all files generated by the module will be saved *	Yes	`""`
`modelType`	Which module to use `simple` or `hierarchical` **	No
`documentOutput`	Path to the output document JSON file ***	No
`model` or `mainModel` (if the model is hierarchical)	Specification object for the (main) topic model	No
`subModel`	Specification object for the sub topic model	Required if `modelType` is `hierarchical`
`hierarchy`	Specification object for the hierarchical assignment between main topic model and sub topic model	Required if `modelType` is `hierarchical`

* These paths are relative to the data directory;
** This gets overwritten by the model type meta-parameter (if set);
*** This path is relative to the outputDir directory.

The specifiations for mainModel (or model) and subModel follow the same structure:

{...
  "model":{...
    "mainModel": {
      "topics": 10,
      "words": 20,
      "docs": 30,
      "iterations": 2000,
      "iterationsMax": 10,
      "topicOuput": "path",
      "serialise": "path",
      // Advanced options
      "topicSimOutput": "path",
      "numWordId": 3,
      "llOutput": "path",
      "topicLogOutput": "path",
      "alphaSum": 1.0,
      "symmetricAlpha": false,
      "beta": 0.01,
      "optimInterval": 50,
      "seed" : 0,
      "wordDistances": false
    },
  ...}
...}

Name	Description	Optional	Default
`topics`	Number of topics to generate	No
`words`	Number of top words to save per topic	Yes	`20`
`documents`	Number of top documents to save per topic	Yes	`20`
`iterations`	Number of sampling iterations to perform	Yes	`1000`
`iterationsMax`	Number of maximisation iterations to perform	Yes	`0`
`topicOutput`	Path to the output topic JSON file *	No
`serialise`	Path to the output serialised model object * **	Yes	`""` (no serialisation)

* These paths are relative to the outputDir directory;
** Serialisation is necessary to later infer documents.

There are also advanced specifications:

Name	Description	Optional	Default
`topicSimOuput`	Path to the CSV file exporting the topic similarity matrix *	Yes	`""` (no export)
`numWordId`	Number of top words to use to identify topics in `topicSimOutput`	Yes	`3`
`llOutput`	Path to the JSON file exporting the model's log-likelihood logs *	Yes	`""` (no export)
`topicLogOutput`	Path to the JSON file exporting the model's topic logs *	Yes	`""` (no export)
`alphaSum`	Sum of topics' alpha values (document to topics distribution Dirichlet prior)	Yes	`1.0`
`symmetricAlpha`	Symmetry of the alpha values during optimisation	Yes	`false` (no symmetry)
`beta`	Words beta values (topic to words distribution Dirichlet prior)	Yes	`0.01`
`optimInterval`	Interval (in number of iterations) between alpha and beta values optimisations	Yes	`50`
`seed`	Index of a random seed to use **	Yes	`0`
`wordDistances`	Computation of the word distribution distances between documents and topics ***	Yes	`false` (no computation)

* These paths are relative to the outputDir directory;
** There are 100 seeds available, seed must therefore be set between 0 and 99 (included);
*** If set to true, the word distances between documents and topics will be saved in the documentOutput JSON file.

The specification for hierarchy should follow this structure:

{...
  "model":{...
    "hierarchy": {
      "assignmentType" :  "Perceptual",
      "maxAssign": 1,
      "modelSimOutput": "path",
      "assignmentOutput": "path"
    },
  ...}
...}

Name	Description	Optional	Default
`assignmentType`	Type of similarity to use for assigning sub topics to main topics: `Perceptual` based on top words overlap, `Document` based on document distributions	Yes	`Perceptual`
`maxAssign`	Maximum number of times a sub topics gets assigned to a main topic	Yes	`1`
`modelSimOutput`	Path to the CSV file exporting the model similarity matrix *	Yes	`""` (no export)
`assignmentOutput`	Path to the CSV file exporting the assignment data *	Yes	`""` (no export)

* These paths are relative to the outputDir directory.

Output

The Topic Model modules output multiple files.

First, the document JSON file, which follows a similar structure to the lemmas and corpus files:

{
  "metadata":{...
    "nTopicsMain":20,
    "nTopicsSub":30
  },
  "documents":[
    {
      "docId":"0",
      "docIndex":0,
      "numLemmas":107,
      "docData":{"key": "value", ...},
      "mainTopicDistribution":[ ... ],
      "subTopicDistribution":[ ... ],
      "mainTopicFullWordDistances":[ ... ],
      "subTopicFullWordDistances":[ ... ],
      "mainTopicCompWordDistances":[ ... ],
      "subTopicCompWordDistances":[ ... ]
    },{
      "docId": "1",
      "docIndex": 1,
      "tooShort": true,
      "numLemmas": 2,
      "docData": {"key": "value2", ...}
    },...
  ]
}

In addition to the information from the lemmas file, the metadata now also contains:

the number of topics nTopicsMain, if the simple Topic Modelling module was used;
the number of main topics nTopicsMain and sub topics nTopicsSub, if the Hierarchical Topic Modelling module was used.

Then the file has a documents list, with one object per document with the following information:

docId the document id;
docIndex the document index;
numLemmas the number of lemmas in that document;
docData the document data that was kept with docFields;
if the document inherited the removed and removeReason attributes from the lemma file, those are kept;
otherwise, the document has been used in the topic model(s) and now has topic weights data:
- mainTopicDistribution the list of (main) topic weights (regardless of which module was uses);
- subTopicDistribution the list of sub topic weights (if the Hierarchical Topic Modelling module was used);
- if wordDistances was set to true ( This is where the Hellinger Scores are saved):
  - mainTopicFullWordDistances the list of word distances between (main) topics and the full document;
  - mainTopicCompWordDistances the list of word distances between (main) topics and their related document's components;
  - subTopicFullWordDistances the list of word distances between sub topics and the full document (if the Hierarchical Topic Modelling module was used);
  - subTopicCompWordDistances the list of word distances between sub topics and their related document's components (if the Hierarchical Topic Modelling module was used).

Second, the topic JSON file(s), either one if using the simple Topic Modelling module, or two if using the Hierarchical Topic Modelling module. They roughly follow the same structure:

{
  "metadata": {...
    "nTopics": 20,
    "nDocs": 20,
    "nWords": 10
  },
  "topics": [
    {
      "topicId": "0",
      "topicIndex": 0,
      "topDocs": [{"docId": "id", "weight": 0.7778}, ... ],
      "topWords": [{"label": "risk", "weight": 85.0}, ... ],
      "subTopicIds": [ ... ],
      "mainTopicIds": [ ... ]
    }, ...
  ],
  "similarities": [ [ ... ], ... ]
}

In addition to the metadata from the lemmas JSON file, the following fields are added:

the number of topics nTopics;
the number of top documents per topic nDocs;
the number of top words per topic nWords.

Then, the file has list of topics, with one object per topic with the following fields:

topicId the topic id;
topicIndex the topic index, used for example in the documents' topic distributions or in the topicSimilarity;
topDocs the top documents for that topic, with their docId and weight;
topWords the top words for that topic, with their label and weight;
if the Hierarchical Topic Modelling module was used, an additional field is added:
- subTopicIds, if this is the main topic JSON file, containing the list of sub topic ids assigned to that main topic;
- mainTopicIds, if this the sub topic JSON file, containing the list of main topic ids assigend to that sub topic.

Finally, the topic JSON file has similarities which contains the similarity matrix between the topics in that file. The matrix is in the form of a list of list of numbers:

{...
  "similarities": [
    [0-0, 0-1, 0-2, ..., 0-n],
    [1-0, 1-1, 1-2, ..., 1-n],
    [2-0, 2-1, 2-2, ..., 2-n],
    ...,
    [n-0, n-1, n-2, ..., n-n]
  ]
}

ModelModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline

Topic Model Modules

List of Topic Model Modules

Specifications

Output

⚠️ GitHub.com Fallback ⚠️

ModelModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline

Topic Model Modules

List of Topic Model Modules

Specifications

Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️