Topic Mapping Pipeline

Topic Distribution Module

The Topic Distribution module reads the topic weights in documents to get customised topic distribution(s) across documents (or document fields), e.g. authors, organisations, years, etc. It then saves this information either in the Topic JSON files, or in separate Distribution JSON file.

The use of this module is optional, but required for using the BubbleMap Topic Mapping module later.

The Topic Distribution module is contained in the P4_Analysis.TopicDistribution package, in the TopicDistribution.java class.

Specifications

The Topic Distribution module entry in the project file should have the following structure:

{...
  "distributeTopics": {
    "documents": "path",
    "topics" | "mainTopics" : "path",
    "subTopics": "path",
    "output" | "mainOutput": "path",
    "subOutput": "path",
    "distributions": [ ... ]
  },
...}

Name	Description	Optional	Default
`documents`	Path to the documents JSON file *	No
`topics` or `mainTopics` (if the model is hierarchical)	Path to the input (main) topics JSON file *	No
`subTopics`	Path to the input sub topics JSON file *	Required if the model is hierarchical	`""` **
`output` or `mainOutput` (if the model is hierarchical)	Path to the output distributed (main) topics JSON *	No
`subOutput`	Path to the output distributed sub topics *	Required if the model is hierarchical
`distributions`	List of specifications for the distributions to estimate, see below	No

* These paths are relative to the data directory;
** This default value implies a non-hierarchical model, if the model type meta-parameter is set to hierarchical, a path must be provided.

The Topic Distribution module allows for multiple distributions to be calculated simultaneously. Each distribution is specified using an object in the distributions field shown above. A distribution specification has the following structure:

{...
  "distributeTopics": {...
    "distributions": [{
      "fieldName": "key",
      "fieldSeparator": "-",
      "valueField": "key2",
      "topPerTopic": 3,
      "output": "path",
      "domainData": "path",
      "domainDataId": "key",
      "domainDataFields": {"key": "value", ...}
    }, ...]
  },
...}

Name	Description	Optional	Default
`fieldName`	Document's key in their `docData` to set the distribution domain, eg, `"institution"` or `"author"`	Yes	`""` (No domain)
`fieldSeparator`	String to split `docData` to get unique domain entries, eg, an author field containing `Name1 & Name2` split into `Name1` and `Name2` using `&`	Yes	`""` (No Split)
`valueField`	Document's key in their `docData` to weight the distribution values, eg, money	Yes	`""` (No weighting)
`topPerTopic`	Number of domain entries to keep per topic in the distribution data *	Yes	`0` (Only save total per topic)
`output`	Path to the separate distribution JSON file where the distribution data should be saved **	Yes	`""` (Save in the topics JSON file)
`domainData`	Path to a CSV file containing additional data about the distribution domain ***	Yes	`""` (No additional data added)
`domainDataId`	Column name, in `domainData`, containing the same `fieldName` identifier for the domain entry	Yes	`"id"`
`domainDataFields`	List of columns, from `domainData`, to include: `{"a":"A"}` -> include column `A` under key `a`	Yes	Empty object

* Setting topPerTopic to -1 will save all entries in the distribution domain for each topic, setting it to 0 will only save the totals for each topic;
** This path is relative to the output directory. If unset or empty, the distribution data will be saved with the topics, in the topic JSON file(s) instead.
*** This path is relative to the source directory. Note that this additional domain data is only saved if the distribution is set to be written in a seperate distribution JSON file.

The image below illustrates the results of using some of these options.

Distribution Options

Output

The distributions generated by the Topic Distribution module can be saved in two ways:

in a separate distribution JSON file;
in the topic JSON file.

The distribution JSON file has the following structure:

{
  "distributionField": "fieldName",
  "distributionValue": "valueName",
  "mainTopics": [
    {
      "topicId": "0",
      "total": 45.0,
      "distribution": [ { "id": "fieldValue1", "weight": 10.0}, ... ]
    },
  ...],
  "subTopics":  [ ... ],
  "domainData": {
    "fieldValue1": {  "dataKey": "dataValue", ... },
    ...
  }
}

distributionField and distributionValue both record, if set in the specifications, the fieldName and valueName of the distribution respectively.

The mainTopics list contains an entry for each of the topics in the mainTopics JSON file:

topicId is the topic identifier;
total is the distribution sum;
distribution lists, for each unique value of fieldName (identified with id), the topic weight (limited to topPerTopic entries if not set to -1);

The subTopics list is only saved if sub topics were provided, its structure is similar to mainTopics.

The domainData list is only saved if an additional domain data CSV file was provided. It records, for each unique value of fieldName, its associated data, as per domainDataFields specifications.

If saved in the topic JSON files, each topic entry gets two additional lists:

{...
"topics": [
   {
     "topicId": "0",
     "topicIndex": 0,
     "subTopicIds": [ ... ],
     "topDocs": [ ... ],
     "topWords": [ ... ],
     "totals": [
       {
         "weight": 247.0,
         "id": "fieldName-valueName"
       }, ...
     ],
     "distributions": [
       {
         "topWeights": [ {"weight": 59.0, "id": "fieldValue1" }, ... ],
         "field": "fieldName",
         "value": "valueName"
       }, ...
     ]
   }, ...
 ]
...}

totals lists the topic total (weight) of each distribution saved, this total is identified (id) by the concatenation of the fieldName and valueName if these were specified.

distributions lists all the distributions for that topic, recording the field and value used, and listing, for each unique value of fieldName (identified with id), the topic weight (limited to topPerTopic entries if not set to -1).

TopicDistributionModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline

Topic Distribution Module

Specifications

Output

⚠️ GitHub.com Fallback ⚠️

TopicDistributionModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline

Topic Distribution Module

Specifications

Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️