Project Parameters - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
To run the pipeline, you must first configure it in a dedicated YAML file, which we refer to as the project file in this documentation.
This file can be divided in 3 sections:
- The project parameters
- The run sequence for modules
- The modules' configurations
For example:
# Project parameters
project:
...
# module sequence
run:
...
# modules
first_module:
...
second_module:
...
These settings define global configurations, that apply commonly to all modules. These settings are written under the project
heading.
Under the directories
subheading, you can specify optional directories for data files read and written by modules (to avoid re-writing those paths many times).
Name | Description |
---|---|
project |
Top-level directory for all project data files |
sources |
Directory for any external input data files |
data |
Directory for all intermediary data files generated by the pipeline |
output |
Directory for all output files generated by the pipeline |
sources
, data
and output
are all relative to project
. project
is relative to the location of the pipeline's executable.
These parameters are optional. They will complete the paths defined by each module:
path_to_pipeline + [project
] + [ sources
| data
| output
] + filename
By default, these will be set to ""
(empty string).
Some modules export documents with data recorded in fields. While each module can define their own list of document fields (docFields
), you may also set an optional common value to be applied globally.
With the docFields
subheading, you can add the list of field name to export. Note that adding docFields
to the modules that use that parameter will overwrite the ones you set globally for the project.
# Project parameters
project:
directories:
project: projects/my_portfolio
sources: src
data: out/tmp
output: out
docFields: [ author, title, date ]
With the above configurations:
- all files will be read/saved under /projects/my_portfolio/
- input data will be read from /projects/my_portfolio/src/
- temporary data will be saved in /projects/my_portfolio/out/tmp/
- output data will be saved in /projects/my_portfolio/out/
- when exporting documents, the fields author, title and date will be kept (unless this parameter is overwritten at module level)
The run sequence allows you to specify which modules to execute when you launch the pipeline. It simply requires you to list the name of modules under the run
heading.
# module sequence
run:
- first_module
- second_module
# - third_module
- fourth_module
In the example above, when launched, the pipeline will first parse this YAML configuration file, and then will execute the modules named first_module, second_module and fourth_module.
third_module is skipped, since it is commented in the run sequence. Therefore, you can use comments to quickly toggle modules on or off.
You can declare a configuration block for each module you wish to run. The block should be under the heading of the module name (same name used in the run sequence).
With the module name, each configuration block should indicate the module type. Then, the mandatory parameters for this module type must be described too. On top of the module optional parameters, you can also add an optional run
parameter. It defaults to true
(meaning the module will run). When set to false
, the module will not run, even if it's listed in the run sequence.
# modules
first_module:
type: inputCSV
...
second_module:
type: buildText
run: false
...
In the example above, we have defined two modules, named first_module
and second_module
respectively. first_module
is an Input CSV module, while second_module
is a Build TExt module. Because second_module
has run
set to false
, it will not execute when the pipeline is launched.