Development Principles - EXIOBASE/docs GitHub Wiki
Some Thoughts
Splitting of Code / Data; Version Control
All the data in this version of EXIOBASE is held seperately to the code.
The code is maintained in a number of Github repositories: EXIOBASE/datamanager, EXIOBASE/macrodb, and EXIOBASE/pipeline being the most important. Documentation is available in EXIOBASE/docs. We use a package, df_file_interchange, developed by us but host on the NTNU-IndEcol GitHub.
Data retrieval, storage, and archival is managed solely by the datamanager.
FAIR Data Handling and Reproducability
As part of a move towards more open science, aspects such as open data and reproducability are coming to the fore, e.g. see NTNU's Policy for Open Science and guidance at NTNU's research data pages.
TODO: note on data management plan?
Commonly used and sensible data handling principle are described in FAIR.
To achieve reproducability is a more onerous requirement and is achived in a multimodal approach.
-
A typical run of the pipeline would comprise of a number of a series of operations. These are associated with a common
runid
. This allows the user to retrieve the results, including intermediate calculations, for an old run at a later date. It also will make it possible to compare two different runs and such like. -
The datamanager has an archive facility which handles permanent storage of external data and intermediate results for each run. This means that even several years after initial processing of a run, it'd be possible to look back and retrieve the original files (assuming sufficient storage and backup facilities are available).
-
The actual code that processes a local relative file / oid may change over time as the format of the file changes, e.g. a spreadsheat obtained from Taiwan's statistical authorities is likely to change frequently. So old code has to be retained. What processing code was used for an instruction within a given run has to be recorded by noting the version. This is in the format YYYYMMDDXXX. This is handled by specifications looking at the version in the resource definitions it has.
-
Finally, the package versions in a conda environment will change over time. This could, in turn, change the output, e.g. in how a file such as Parquet is encoded. So there is a mechanism to record this and check if the environment is as it was when the run was created. See the conda section in the datamanager page for more info.
There is more comprehensive technical information on both these aspects in the datamanager page.