TopicDistributionModule_v2 - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
The Topic Distribution module reads the topic weights in documents to get customised topic distribution(s) across documents (or document fields), e.g. authors, organisations, years, etc. It then saves this information either in the Topic JSON files, or in separate Distribution JSON file.
The use of this module is optional, but required for using the BubbleMap Topic Mapping module later.
The Topic Distribution module is contained in the P4_Analysis.TopicDistribution
package, in the
TopicDistribution.java
class.
The Topic Distribution module entry in the project file should have the following structure:
{...
"distributeTopics": {
"documents": "path",
"topics" | "mainTopics" : "path",
"subTopics": "path",
"output" | "mainOutput": "path",
"subOutput": "path",
"distributions": [ ... ]
},
...}
Name | Description | Optional | Default |
---|---|---|---|
documents |
Path to the documents JSON file * | No | |
topics or mainTopics (if the model is hierarchical) |
Path to the input (main) topics JSON file * | No | |
subTopics |
Path to the input sub topics JSON file * | Required if the model is hierarchical |
"" ** |
output or mainOutput (if the model is hierarchical) |
Path to the output distributed (main) topics JSON * | No | |
subOutput |
Path to the output distributed sub topics * | Required if the model is hierarchical | |
distributions |
List of specifications for the distributions to estimate, see below | No |
- * These paths are relative to the data directory;
- ** This default value implies a non-hierarchical model, if the model type meta-parameter is set to
hierarchical
, a path must be provided.
The Topic Distribution module allows for multiple distributions to be calculated simultaneously. Each distribution
is specified using an object in the distributions
field shown above. A distribution specification has the following
structure:
{...
"distributeTopics": {...
"distributions": [{
"fieldName": "key",
"fieldSeparator": "-",
"valueField": "key2",
"topPerTopic": 3,
"output": "path",
"domainData": "path",
"domainDataId": "key",
"domainDataFields": {"key": "value", ...}
}, ...]
},
...}
Name | Description | Optional | Default |
---|---|---|---|
fieldName |
Document's key in their docData to set the distribution domain, eg, "institution" or "author"
|
Yes |
"" (No domain) |
fieldSeparator |
String to split docData to get unique domain entries, eg, an author field containing Name1 & Name2 split into Name1 and Name2 using &
|
Yes |
"" (No Split) |
valueField |
Document's key in their docData to weight the distribution values, eg, money |
Yes |
"" (No weighting) |
topPerTopic |
Number of domain entries to keep per topic in the distribution data * | Yes |
0 (Only save total per topic) |
output |
Path to the separate distribution JSON file where the distribution data should be saved ** | Yes |
"" (Save in the topics JSON file) |
domainData |
Path to a CSV file containing additional data about the distribution domain *** | Yes |
"" (No additional data added) |
domainDataId |
Column name, in domainData , containing the same fieldName identifier for the domain entry |
Yes | "id" |
domainDataFields |
List of columns, from domainData , to include: {"a":"A"} -> include column A under key a
|
Yes | Empty object |
- * Setting
topPerTopic
to-1
will save all entries in the distribution domain for each topic, setting it to0
will only save the totals for each topic; - ** This path is relative to the output directory. If unset or empty, the distribution data will be saved with the topics, in the topic JSON file(s) instead.
- *** This path is relative to the source directory. Note that this additional domain data is only saved if the distribution is set to be written in a seperate distribution JSON file.
The image below illustrates the results of using some of these options.
The distributions generated by the Topic Distribution module can be saved in two ways:
- in a separate distribution JSON file;
- in the topic JSON file.
The distribution JSON file has the following structure:
{
"distributionField": "fieldName",
"distributionValue": "valueName",
"mainTopics": [
{
"topicId": "0",
"total": 45.0,
"distribution": [ { "id": "fieldValue1", "weight": 10.0}, ... ]
},
...],
"subTopics": [ ... ],
"domainData": {
"fieldValue1": { "dataKey": "dataValue", ... },
...
}
}
distributionField
and distributionValue
both record, if set in the specifications, the fieldName
and valueName
of the distribution respectively.
The mainTopics
list contains an entry for each of the topics in the mainTopics
JSON file:
-
topicId
is the topic identifier; -
total
is the distribution sum; -
distribution
lists, for each unique value offieldName
(identified withid
), the topicweight
(limited totopPerTopic
entries if not set to-1
);
The subTopics
list is only saved if sub topics were provided, its structure is similar to mainTopics
.
The domainData
list is only saved if an additional domain data CSV file was provided. It records, for each unique
value of fieldName
, its associated data, as per domainDataFields
specifications.
If saved in the topic JSON files, each topic entry gets two additional lists:
{...
"topics": [
{
"topicId": "0",
"topicIndex": 0,
"subTopicIds": [ ... ],
"topDocs": [ ... ],
"topWords": [ ... ],
"totals": [
{
"weight": 247.0,
"id": "fieldName-valueName"
}, ...
],
"distributions": [
{
"topWeights": [ {"weight": 59.0, "id": "fieldValue1" }, ... ],
"field": "fieldName",
"value": "valueName"
}, ...
]
}, ...
]
...}
totals
lists the topic total (weight
) of each distribution saved, this total is identified (id
) by the
concatenation of the fieldName
and valueName
if these were specified.
distributions
lists all the distributions for that topic, recording the field
and value
used, and listing, for
each unique value of fieldName
(identified with id
), the topic weight
(limited to topPerTopic
entries if not
set to -1
).