Project Parameters - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

To run the pipeline, you must first configure it in a dedicated YAML file, which we refer to as the project file in this documentation.

This file can be divided in 3 sections:

  1. The project parameters
  2. The run sequence for modules
  3. The modules' configurations

For example:

# Project parameters
project:
  ...

# module sequence
run:
  ...

# modules
first_module:
  ...

second_module:
  ...

Project Parameters

These settings define global configurations, that apply commonly to all modules. These settings are written under the project heading.

Directories

Under the directories subheading, you can specify optional directories for data files read and written by modules (to avoid re-writing those paths many times).

Name Description
project Top-level directory for all project data files
sources Directory for any external input data files
data Directory for all intermediary data files generated by the pipeline
output Directory for all output files generated by the pipeline

sources, data and output are all relative to project. project is relative to the location of the pipeline's executable.

These parameters are optional. They will complete the paths defined by each module:

path_to_pipeline + [project] + [ sources | data | output ] + filename

By default, these will be set to "" (empty string).

Document Fields

Some modules export documents with data recorded in fields. While each module can define their own list of document fields (docFields), you may also set an optional common value to be applied globally.

With the docFields subheading, you can add the list of field name to export. Note that adding docFields to the modules that use that parameter will overwrite the ones you set globally for the project.

Example

# Project parameters
project:
  directories:
    project: projects/my_portfolio
    sources: src
    data: out/tmp
    output: out
  docFields: [ author, title, date ]

With the above configurations:

  • all files will be read/saved under /projects/my_portfolio/
  • input data will be read from /projects/my_portfolio/src/
  • temporary data will be saved in /projects/my_portfolio/out/tmp/
  • output data will be saved in /projects/my_portfolio/out/
  • when exporting documents, the fields author, title and date will be kept (unless this parameter is overwritten at module level)

Run Sequence

The run sequence allows you to specify which modules to execute when you launch the pipeline. It simply requires you to list the name of modules under the run heading.

Example

# module sequence
run:
  - first_module
  - second_module
  # - third_module
  - fourth_module

In the example above, when launched, the pipeline will first parse this YAML configuration file, and then will execute the modules named first_module, second_module and fourth_module.

third_module is skipped, since it is commented in the run sequence. Therefore, you can use comments to quickly toggle modules on or off.

Module Parameters

You can declare a configuration block for each module you wish to run. The block should be under the heading of the module name (same name used in the run sequence).

With the module name, each configuration block should indicate the module type. Then, the mandatory parameters for this module type must be described too. On top of the module optional parameters, you can also add an optional run parameter. It defaults to true (meaning the module will run). When set to false, the module will not run, even if it's listed in the run sequence.

Example

# modules
first_module:
  type: inputCSV
  ... 

second_module:
  type: buildText
  run: false
  ...

In the example above, we have defined two modules, named first_module and second_module respectively. first_module is an Input CSV module, while second_module is a Build TExt module. Because second_module has run set to false, it will not execute when the pipeline is launched.

⚠️ **GitHub.com Fallback** ⚠️