General and Development Notes - EXIOBASE/docs GitHub Wiki

General and Development Notes

This page describes general aspects for the EXIOBASE install, design patterns, etc.

Directory Layout

At the root of the installation there are two files.

./exiobase_install_notice.txt: The install notice file created by bootstrap. This is primarily used for bootstrap to know that it is an Exiobase installation.
./exiobase_install_hmac.txt: An HMAC of the install notice. I should have just done a hash to keep things simple but, meh, it's fine and no point changing now.

The directory layout for an Exiobase installation is as follows

./repo/: The code repositories.
./var/: All the data files, db, logs, etc, for the installation.

Under repo.

./repo/bootstrap/: The installer/updater. This may not be the copy you originally used to install Exiobase but it will update the installation (in this case, it'll discover the installed path automatically). Bootstrap must run under the exio-bootstrap conda environment and you must install this using the environment.yaml file in the bootstrap repo.
./repo/pipeline/: The Snakemake pipeline. This should run under the exio-snakamek environment, which can be setup with the environmenta.yaml file in the repo.
./repo/datamanager/: This does all the file handling, i.e. archive, activeset with fileregistry, downloading of files over HTTP, and provides the code framework from which macrodb, etc, is based. It runs under the exio-datamanger environment. At time of writing, you must install this yourself from the environment.yaml file in the repo but bootstrap can update the environment for you.
./repo/macrodb/: The macrodb stuff. At the moment, this is mainly converting some Excel spreadsheets we downloaded to Parquet format.

Under var.

./var/archive/: Where the archive is stored. Do not try to modify this yourself, do it through datamanager.
./var/log/: Where the logs are.
./var/log/dm/: Where datamanager will dump logs if setup_logging() is executed. The files are split into syslog style levels with both human-readable .log files and .json files.
./var/meta/: Where the sqlite3 database lives, in datamanager_meta.db.
./var/settings/: Where "settings" are stored, in datamanager_settings.yaml. This isn't really being used, might deprecate.
./var/work/: Where the working files are kept.
./var/work/activeset/: The main working files. These need to be tracked by the file registry. They directory structure is also of a specific format, see the idea of a local relative filename and oid.
./var/work/ephemeral/: Where "throw-away" files can be stored. For example, if you're doing some processing you don't want or need to be tracked. The HTTP downloads origianlly end up here before being archived.
./var/work/log: This probably shouldn't be here. I think it was maybe an artefact of earlier setup with Snakemake. Meh, leave it for a while to see if anything gets written there.
./var/work/queryres: This is the default output space for query results being written to file.

Note re Annoyances When Deactivating a Run

Datamanager won't directly let you deactivate a run when, either: there's files in the file registry; or, there's files in activeset. Usually these should be the same entries but, if something has gone wrong, this might not be so. Anyway, this is designed to protect you! Datamanager would otherwise have to decide itself what to do, i.e. usually delete files, which might not be what you want.

More typically, you've probably got some files in activeset that are in the registry but aren't archived or some files that are actually archived.

i. If you are happy to archive all the files in activeset and delete them automatically, then you can run the DeactivateRun instruction with the parameter run_archive_all_on_deactivate=True. This won't deal with files in activeset that aren't in the file registry but it will handle most other situations. Assuming you want to archive those files, that is. For example, using the quick function: dm.cmd.deactivate_run(run_archive_all_on_deactivate=True).
ii. The same effect can be achieved by using the ArchivePutAllFromActivesetInstruction with the archive_delete_original_after_archive=True command_str parameter, then deactivate the run. With the quick functions, this is dm.cmd.archive_put_all_from_activeset(archive_delete_original_after_archive=True).
i. If you want do just delete everything in the activeset and file registry without archiving then you can run DeactivateRun with the run_force_clean_on_deactivate=True parameter set. This is typically a bad idea. With the quick commands this is dm.cmd.deactivate_run(run_force_clean_on_deactivate=True)
ii. You can get the same effect by using ForceCleanActivesetAndFileregistryInstruction, ensuring the force_clean_sure_flag=True parameter is set (it's a safeguard). Then use DeactivateRun as usual. With the quick commands this is just dm.cmd.force_clean_activeset_and_fileregistry(force_clean_sure_flag=True).

For more complicated situations, you should really (but the force clean will work with these anyway):

If you have files in activeset that aren't registered then you must either remove them or register them before being able to deactivate the run.
If you have file in the file registry that aren't in activeset then you need to delete from the file registry

Note on Parallelism

Typically, you can supply a list of output files to many datamanager (or macrodb) commands. These would be processed in serial in one request. This is helpful if executing these commands manually but it's not ideal if running from Snakemake. In that case, Snakemake's options and directives should be used so that multiple instances of datamanaager, macrodb, etc are spawned so that the processing runs in parallel.

TODO example of how to do parallelism in snakemake

This parallel use of datamanager does, however, come with some consequences. The main one is that sqlite has locking semantics for db writes: the first process to open a transaction with one or more writes will lock the db; the lock will be released when the transaction is either committed or rolled-back. When opening a connection, a timeout option can be specified. If a process tries to write to a locked db, it'll wait up until timeout seconds; if the timeout is reached, an exception is raised. So, the timeout is set very long at 5 minutes (compared to the default of like 5 seconds). Unless a writing process actually stalls, this should be long enough so that the pipeline can continue with processes waiting for each other to complete writes. It is NOT a good thing if one of these reaches the timeout: it'll cause the pipeline to fail at that point.

Finally, there is one and only one aspect of datamanager that is parallelised at the thread level: the HTTP downloads. Datamanger takes advantage of curl via pycurl. This has a very high performance threaded implementation, so the parallelism happens there. Only use one process in Snakemake for this and throw all the output files at it in one go!

Use of `df_file_interchange`

To "round-trip" files to disc, we use df_file_interchange. This can write and read files to either CSV or Parquet and maintains an accompanying YAML meta file. Unlike bare write/read functionality provided in Pandas, the interchange code ensures that dtypes and indices are faithfully reproduced (rather than merely enumerated).

See the df_file_interchange GitHub page for more information.

Use of Pydantic

Pydantic v2 Syntax and Our Usage

Notes re Particular Enums and Arrangement of `Oid`s / `local_rel_filename`s

In datamanager.common.enums we define most of the enumerated types.

The ClassificationEnum is used directly in the oid format and says what that file is for/doing, e.g. httpresource, parquetraw, auxresource. You'll need to add entries here as you go.

The SectionSpecEnum is used to denote roughly which "block" in the paper that we're dealing with, e.g. macrodb. By convention, we also use this in the oid format as the first item in the "elements".

For example:

oid='file:httpresource:macrodb/un/sna-mainagg/gdp-usd-current-countries' / local_rel_filename='httpresource/macrodb/un/sna-mainagg/gdp-usd-current-countries.xlsx'
oid='file:parquetraw:macrodb/un/sna-mainagg/gdp-usd-current-countries' / local_rel_filename='parquetraw/macrodb/un/sna-mainagg/gdp-usd-current-countries.parq'

You can see the ClassificationEnum of httpresource in the second part of the oid and the SectionSpecEnum of macrodb as the first of the "elements".

For information regarding Oids and use of local_rel_filename, see the relevant section in the main datamanager docs.

Datamanager Design Patterns and Extending Functionality

Number Types: Storage, Processing, and Calculations

There are several issues surrounding processing and storage of numbers that need forethought and care.

In some of the data sources, we're faced with real numbers measuring the same quantity but which are appreciably different in scale, e.g. the UN macro data. In 2021, the United States has a "Total Value Added" value of 2.3e13 whereas there are per-sector entries for other countries around 1e7.
The accuracy of stated values is not obvious. In some cases, it's clear that the number of "significant digits" is less than the number of digits used or available in the decimal represantation. Using the previous example, the US figures from the UN macro data appear to be rounded so as to lose the last six or even nine decimal places. This makes sense but it does raise a few questions when some other countries' values are, by comparison, a rounding error.
Most of the data sources available were produced by other agencies after processing using statistical packages or Excel. These typically use extended precision arithmetic (80 bits) and double precision storage (64 bits), e.g. the Excel doc page here. So we cannot expect our data sources to have any better than 15 significant decimal digits of accuracy. This is usually not an issue but it's worth noting, especially in the context of the disparity in scales.

When using Frictionless to import data, it will typically store numeric values as a https://docs.python.org/3/library/decimal.html or possibly an int. The Decimal type has the advantage that it can represent a real value and perform calculations to arbitrary (but specified) precision albeit with an associated performance penalty, until possible rounding at the end of the calculations. It is likely that the default rounding semantics will not match standard IEEE 754-1985 double precision arithmetic.

It may be sensible to perform preliminary calculations using Decimal, with suitably high precision, until a necessary cast to float is imposed. This is almost unavoidable if any numerical linear algebra or optimisation has to be performed since all the usual software operates with double precision floats (this is the point at which one must think carefully about the implications of disparate scales).

~~Warning: Converting a Frictionless resource to Pandas will cast the Decimals to floats.~~

Logging Strategy

For each of the rules in Snakemake, the Python (or other) code must do its own logging.

Loguru is used to do the logging in Python. A la syslog, the primary sinks are error, warning, info, and debug with standard meanings. In addition, the error and warning levels are logged to console.

Naming Convention(s)

XXXX TODO