TopicDistributionModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Topic Mapping Pipeline


Topic Distribution Module

The Topic Distribution module reads the topic weights in documents to get customised topic distribution(s) across documents (or document fields), e.g. authors, organisations, years, etc. It then saves this information either in the Topic JSON files, or in separate Distribution JSON file.

The use of this module is optional, but required for using the BubbleMap Topic Mapping module later.

The Topic Distribution module is contained in the P4_Analysis.TopicDistribution package, in the TopicDistribution.java class.

Specifications

The Topic Distribution module entry in the project file should have the following structure:

{...
  "distributeTopics": {
    "documents": "path",
    "topics" | "mainTopics" : "path",
    "subTopics": "path",
    "output" | "mainOutput": "path",
    "subOutput": "path",
    "distributions": [ ... ]
  },
...}
Name Description Optional Default
documents Path to the documents JSON file * No
topics or mainTopics (if the model is hierarchical) Path to the input (main) topics JSON file * No
subTopics Path to the input sub topics JSON file * Required if the model is hierarchical "" **
output or mainOutput (if the model is hierarchical) Path to the output distributed (main) topics JSON * No
subOutput Path to the output distributed sub topics * Required if the model is hierarchical
distributions List of specifications for the distributions to estimate, see below No

The Topic Distribution module allows for multiple distributions to be calculated simultaneously. Each distribution is specified using an object in the distributions field shown above. A distribution specification has the following structure:

{...
  "distributeTopics": {...
    "distributions": [{
      "fieldName": "key",
      "fieldSeparator": "-",
      "valueField": "key2",
      "topPerTopic": 3,
      "output": "path",
      "domainData": "path",
      "domainDataId": "key",
      "domainDataFields": {"key": "value", ...}
    }, ...]
  },
...}
Name Description Optional Default
fieldName Document's key in their docData to set the distribution domain, eg, "institution" or "author" Yes "" (No domain)
fieldSeparator String to split docData to get unique domain entries, eg, an author field containing Name1 & Name2 split into Name1 and Name2 using & Yes "" (No Split)
valueField Document's key in their docData to weight the distribution values, eg, money Yes "" (No weighting)
topPerTopic Number of domain entries to keep per topic in the distribution data * Yes 0 (Only save total per topic)
output Path to the separate distribution JSON file where the distribution data should be saved ** Yes "" (Save in the topics JSON file)
domainData Path to a CSV file containing additional data about the distribution domain *** Yes "" (No additional data added)
domainDataId Column name, in domainData, containing the same fieldName identifier for the domain entry Yes "id"
domainDataFields List of columns, from domainData, to include: {"a":"A"} -> include column A under key a Yes Empty object
  • * Setting topPerTopic to -1 will save all entries in the distribution domain for each topic, setting it to 0 will only save the totals for each topic;
  • ** This path is relative to the output directory. If unset or empty, the distribution data will be saved with the topics, in the topic JSON file(s) instead.
  • *** This path is relative to the source directory. Note that this additional domain data is only saved if the distribution is set to be written in a seperate distribution JSON file.

The image below illustrates the results of using some of these options.

Distribution Options

Output

The distributions generated by the Topic Distribution module can be saved in two ways:

  • in a separate distribution JSON file;
  • in the topic JSON file.

The distribution JSON file has the following structure:

{
  "distributionField": "fieldName",
  "distributionValue": "valueName",
  "mainTopics": [
    {
      "topicId": "0",
      "total": 45.0,
      "distribution": [ { "id": "fieldValue1", "weight": 10.0}, ... ]
    },
  ...],
  "subTopics":  [ ... ],
  "domainData": {
    "fieldValue1": {  "dataKey": "dataValue", ... },
    ...
  }
}

distributionField and distributionValue both record, if set in the specifications, the fieldName and valueName of the distribution respectively.

The mainTopics list contains an entry for each of the topics in the mainTopics JSON file:

  • topicId is the topic identifier;
  • total is the distribution sum;
  • distribution lists, for each unique value of fieldName (identified with id), the topic weight (limited to topPerTopic entries if not set to -1);

The subTopics list is only saved if sub topics were provided, its structure is similar to mainTopics.

The domainData list is only saved if an additional domain data CSV file was provided. It records, for each unique value of fieldName, its associated data, as per domainDataFields specifications.

If saved in the topic JSON files, each topic entry gets two additional lists:

{...
"topics": [
   {
     "topicId": "0",
     "topicIndex": 0,
     "subTopicIds": [ ... ],
     "topDocs": [ ... ],
     "topWords": [ ... ],
     "totals": [
       {
         "weight": 247.0,
         "id": "fieldName-valueName"
       }, ...
     ],
     "distributions": [
       {
         "topWeights": [ {"weight": 59.0, "id": "fieldValue1" }, ... ],
         "field": "fieldName",
         "value": "valueName"
       }, ...
     ]
   }, ...
 ]
...}

totals lists the topic total (weight) of each distribution saved, this total is identified (id) by the concatenation of the fieldName and valueName if these were specified.

distributions lists all the distributions for that topic, recording the field and value used, and listing, for each unique value of fieldName (identified with id), the topic weight (limited to topPerTopic entries if not set to -1).

⚠️ **GitHub.com Fallback** ⚠️