Datamanager - EXIOBASE/docs GitHub Wiki

Datamanager

Interfacing with Datamanager

The canonical way to interface with datamanager is the ExtArgs ("external arguments") paradigm: one creates an ExtArgs object, which specifies what command to exectue with parameters, and passes it to the entry_point() function. Snakemake and CLI also implicitly use this inferface. The script: syntax in Snakemake serializes the parameter objects and passes them to the target script; there's dedicated code in datamanager that intercepts this and creates the relevant ExtArgs object. Similarly, there's code in datamanager that will parse any commandline options into an ExtArgs object. This makes everything consistent. However, for ease of programmatic use, there are wrapper functions defined which make executing instructions easy: they setup the required ExtArgs and call the corresponding instruction.

The ExtArgs class is defined in ./local/ext_args.py and is a Pydantic model. This means the fields are type-checked upon instantiation. Pydantic can also coerce types and instantiate models from dictionaries or JSON. There's two other classes instantiated themselves as fields in ExtArgs: ExtArgsCommandStr "command string" and ExtArgsQueryStr "query string".

After importing datamanger, a caller can send multiple ExtArgs to entry_point() in sequence. Notably, entry_point() opens a unique Sqlite connection and closes it before returning. This makes it less likely that other processes, possibly also executing datamanger calls, get hit with a locked database. Technical detail: the connect timeout is set to 5 minutes at the moment (see sqlite_context.py), this is perhaps excessive but is intended to avoid potential failure scenarios in an unattended pipeline.

For each ExtArgs there has to be a "simple-command" specified, which tells datamanager what we want done. A list of these commands can be obtained from datamanager itself; see the tutorial for how to do this. There is a special command, "query", that allows the user to query the internal state of datamanager, e.g. get information on runs, files in the archive, etc. Again, a list of all possible queries can be obtained from datamanager and there's an example usage of each one in the reference in the tutorials directory, see below.

There's examples of how to call datamanager from Python using ExtArgs in the tutorial repo/datamanager/tutorials/tutorial_datamanager_by_extargs.ipynb and an example of each of all the possible queries in datamanager in repo/datamanager/tutorials/reference_all_queries.ipynb.

Wrapper Functions

If importing datamanager directly in Python, one can optionally use the wrapper functions instead. These are one-liners where the wrapper functions do the setup of ExtArgs and only expose thsoe arguments which are necessary. The functions are aliased under dm.cmd and dm.qry. See tutorials/tutorial_short_wrapper_fns.ipnyb for a short example of usage.

Commandline

The CLI works in a similar manner to git. The format is roughly:

python -m datamanager [main switches] {cmd, qry} [switches for cmd or qry]

Note that the main switches have to come before the cmd or qry. This seems to be by design in Python's argsparse.

For example:

python -m datamanager --runid="dev-1" --runtype=dev --run-owner=test cmd --simple-command=create-run

A query can be done similarly:

python -m datamanager qry --query-command="get-auxresources-filenames"

Overview of Concepts

Runid

When working with EXIOBASE, each operation must be performed with an associated runid. This allows precise control and isolation over different release runs and development work. For example, older runs can be recovered in their entirety to check details. In principle at least, it would be possible to also compare runs in a stepwise manner. There are two types of runtype.

  • Release: These are the formal releases and should be properly versioned.

  • Development: These are for use in development.

Release runids and associated data must be retained. Development runids and the associated data may be intermittently deleted. Obviously, if some data is attached to different runids then retention is based on the the strongest requirement.

At any one time, at most one runid can be active, this is managed by a call to datamanager. Runs can be "activated" or "deactivated". Before activating or deactivating a run, datamanager will do various checks on activeset, the file registry information, etc, to ensure a clean environment and that no data is inadvertantly lost. One can lose data deliberately but, if I've done things correctly, not so easily by accident.

A runid may be finalised, in which case it cannot be modified.

File Spaces

The datamanager operates with three file spaces.

  • Active Set: This is considered the authoritative file set for the currently active run. Datamanger can pull files from the archive into the activeset upon request, put files into the archive from activeset, and other processes can create files in activeset. The files in activeset are tracked, and additional information is associated with each file. So, if another process creates files in activeset then they must also be added to the file registry.

  • Ephemeral: Files in here are to be considered completely transient and are typically not tracked. It's used as a space for work that may create files we never want to archive but need to work with temporarily. Files in this area should generally be deleted soon after use.

  • Archive: This is the primary long-term storage. No work gets done here. Multiple revisions of the same file may be stored in the archive; datamanger keeps track of the revisions and which runids they belong to. The user never accesses these directly, only via requests to datamanager.

The general idea is that files in the activeset should be mirrored in the archive but that the activeset contains the "working files".

Sometimes extra processing needs to be done with files and not all of that needs to end up in the archive. These should generally go into ephemeral, and then have datamanager archive the subset of files that need to be retained.

A suitable example of this arrangement is downloading files via HTTP. Each run of EXIOBASE may use the same "file" but it could have changed on the remote server. To handle this consistently, datamanager downloads files into ephemeral first. Afterwhich, an archive operation is initiated which calculates the file hash and then updates the archive accordingly; the file is then deleted from ephemeral. If the user wants to work with these files, they request datamanger pull them from the archive into the active set.

The datamanager archive code is efficient in the sense that it only stores one copy for each revision of the file. This is identified by the file's hash. A different hash means it's a different revision of the file. However, this alone is incomplete: the same file may be archived many times, even if the revision changes only a few times. These "savepoints" are also tracked. Therefore, there is meta data to keep track of the file descriptions, revision, and savepoints.

There is the concept of an oid and local_rel_filename that are described below. These identify one file or object in datamanager but there can be several revisions of a file with the same oid. The local relative filename is used to identify files, one just prefixes the path for activeset or ephemeral to get the absolute filename; a similar, albeit slightly different, approach is used in archive.

oid, local_rel_filename, LocalFileRelaxed, and LocalFileComplete

Oids

Each file or object in EXIOBASE has an associated oid ("object identifier"). This is to keep track of files/objects internally. For a given oid, there many be many revisions of that file/object though: the oid refers to the file/object, not its contents. For example, if a spreadsheet is obtained from Taiwan's National Statistics then the formatting and contents may change with time but we need a consistent way to identify that file. The only way the oid changes is if the description of the data stored in the file changes, e.g. if a sheet in a spreadsheet is moved to a different file; this last circumstance would also change the graph dependencies in the pipeline (this is quite important).

Note: If a new file is being created by datamanager or otherwise then it needs its own oid/local_rel_filename. Without this, the Snakemake pipeline will fail because it uses the filenames to match vertices in its DAG. This is also the reason the local_rel_filename is formalised in the way it is: so that we have a consistent naming scheme to ensure Snakemake works as intended.

The format of an oid is: oid_type:classification:elements. The oid_type is either file or object. The classification is what that file or object is doing, e.g. httpresource or parquetraw, and must be of the ClassificationEnum enumerated type. The elements are dependent on the oid_type and (potentially) classification but should identify the file or object with forward-slash separated ids (an id is lowercase alphanumeric and hyphens). The elements for a file are typically the representation of the local filename without the eventual file extension (this was a design decision to gloss-over minor file extension changes). For example, file:httpresource:macrodb/un/sna-mainagg/gdp-usd-current-countries is a file, obtained from an http resource, is in the macro section, and has the UN as provider.

There is an Oid class which when instantiated represents an oid and does necessary format checking. An Oid object exposes some methods which makes working with oids easier.

For further discussion on convention for the format of the oid, examples, and also of ClassificationEnum and SectionEnum, please see the section in the general development notes.

local_rel_filename(s), LocalFileRelaxed, and LocalFileComplete

For a file oid, one can associate a local_rel_filename (local relative filename). Continuing the previous example, httpresource/macrodb/un/sna-mainagg/gdp-usd-current-countries.xlsx. This relative path can then be prefixed by either the activeset or ephemeral base path. There are similar semantics used when storing in the archive but the filename is made unique.

To create an association between the local_rel_filename and an oid, one must know the file extension; the LocalFileRelaxed class will guess if not supplied (greedy, so assumes everything after the first period is the extension). Again, we don't store the extension in the oid because the file supplier may change the extension but the file remains as the same entity (yes, this happened already and was the reason for this approach). One must also know some other things about each file. For example, we need to know if a file has an associated YAML metainfo file. This is formalised in the LocalFileRelaxed (some optional fields) and LocalFileComplete (the same but with all fields specified); LocalFileCompelte inherits from LocalFileRelaxed. This allows datamanager to associate the required additional information with each file. The fields in a LocalFile object should be accessed via the exposed methods.

There are two convenience functions in defintions.py: get_local_rel_filename_from_oid() and get_oid_from_local_rel_filename(). For the latter, it will guess the extension if not supplied. This can be useful in certain edge cases but probably best avoided. This functions are aliased into dm..

A LocalFile contains the following attributes:

  • oid (an Oid object), the oid for the file;

  • local_rel_filename (Path object), the filename with path relative to activeset/ephemeral;

  • extension (str), the file extension;

  • file_area (FileAreaEnum), the file area (activeset or ephemeral) the file resides in;

  • hash (str), the sha256 hash of the file;

  • meta_filename (Path or None), if the file has a metainfo file then the name, without any path prefix, of that file;

  • comment (str), description of the file;

  • version (int), specific integer representation of the file version; and,

  • origin (str), what created the file; this should be auto-defined, cf. the resource definitions for Http and ExtRawToParquet acccording to required format (I'm not quite sure how I did this but it should be defined by the resource definitions).

Version

The version field of the LocalFile is an integer but it must be in the following format: YYYYMMDDXXX where YYYY is the year, MM is the month, DD is the day, and XXX is an straightforward counter that starts from 001 (maximum 999 different versions possible for the same file on the same day). The reason for the formatting choice is that it allows easy inequality comparisons. An example version might be 20240705006 which means the resource definition was created on the 5th July 2024 and is the 6th increment of that day. You can have multiple resources definitions for the same oid but with different versions; there's logic in the specifications that can pick either the latest version or a user-specified version when requesting resources.

The version indicates the version of the associate resource definition and NOT the version of the file per se. The latter would be almost impossible to formalise and we don't need it anyway. So, again, the version is the version of the specification/resource that created the file, e.g. the version of code+spec that processed the file from a raw HTTP download into a Parquet file. This may seem weird but it makes more sense with some thought (consider each processing step that converts one file to another as an edge in the graph, then it's in effect each edge that we're versioning).

Archive

The archive is where files are stored. They can be retrieved into the activeset with a request to datamanager. For each oid, there may be several versions of that file. The archiver is smart enough to only store one copy for each version, which is determined by the sha256 hash, even if it's used by several runs. Associated metadata allows the archiver to store information about savepoints in connection with different runids.

For more information, see the archive implementation details further down this page.

To recover local file space, it's likely that an export-import mechanism will be implemented in the future so that old runs can be backed to remote storage then removed locally.

File Registry

TODO expand this. Mention validation and that it'll check hashes of files also.

As noted in the LocalFile discussion, it turns out datamanager needs some additional information for each file. So, if an external process creates files in activeset then we need to be able to inform datamanger of this information. Similarly, if datamanager itself retrieves files, e.g. from aux_resources, then it has to know what this extra information is.

This is managed via the FileregistryInfoCtrl class. Any file in activeset must have a fileregistry_info entry, which translates to a record in the database. Similarly, any file being removed from activeset must then remove the corresponding entry from fileregistry_info.

There are commands/instructions that do this properly:

  • RegisterFileregistryInfoFilesInstruction (simple-command register-fileregistry-files);

  • DeregisterFileregistryInfoFilesInstruction (simple-command deregister-fileregistry-files); and,

  • DeregisterAllFileregistryInfoFilesInstruction (simple-command deregister-fileregistry-files-all).

ExtArgs

As mentioned earlier, a class ExtArgs is defined which, when instantiated, determines what instruction should be executed and all relevant options. ExtArgs is a Pydantic model and has two attributes, ExtArgsCommandStr and ExtArgsQueryStr which store sub-options relevant to commands and queries.

These can be constructed using the usual object construction syntax in Python, or be constructed from JSON, etc. See XXXXXXXX TODO for practical examples.

In exec_control/external.py, there is code that can construct an ExtArgs object based on commandline argument or the parameters provided from Snakemake. These are called from __main__.py to handle CLI or Snakemake requests, so you don't have to worry about this. What you do need to know is how to actually use this functionality. XXXX TODO.

Transactions

Each command/instruction is noted in transactions table(s) in the database along with the files it modifies. External instructions should use the corresponding command in datamanger to record their transactions too (TODO!).

This allows later analysis of how a run has proceeded, which files were modified, etc. Through the database, it can also be linked to the other information stored on runs, archive, etc. TODO: need to write queries for this.

Auxiliary Resources

There are some files or data that we do not obtain from third-parties but shouldn't be stored with code, e.g. the "condordance / correspondence matrices". These are stored in the aux_resources repository along with schema, which is read by datamanager.

Use of Parquet (df_file_interchange)

We don't use Parquet in datamanager directly but datamanager DOES have functionality to track an associated YAML meta file for each file it handles. The other repos, like macrodb, make use of this functionality and do store data in Parquet format. This is, however, handled by the df_file_interchange package. See the notes on usage here and the df_file_interchange GitHub page.

Note that df_file_interchange is installed as a pip package from PyPi in the exio-datamanager conda environment.

Datamanager Architecture

The design is slightly convoluted but it is nevertheless systematic and is a reasonable compromise to handle the requirements, e.g. accomodating Snakemake's script serialization and also presenting a CLI+reasonable function interface. The idea of an oid and local_rel_filename is also intended to ensure internal consistency so that Snakemake works as expected.

An overview is provided in the diagrams and explanations that follow.

Diagrams

Interface and Dispatch

https://github.com/EXIOBASE/docs/blob/main/diagrams_net/images/dm_overview_interface_and_dispatch.png

Instructions and Specifications

https://github.com/EXIOBASE/docs/blob/main/diagrams_net/images/dm_overview_instructions_specifications.png

Meta (internal sqlite db)

https://github.com/EXIOBASE/docs/blob/main/diagrams_net/images/dm_overview_meta.png

Archiver and File Registry

https://github.com/EXIOBASE/docs/blob/main/diagrams_net/images/dm_overview_archiver_and_registry.png

Meta ./meta/

The code in this directory is low-level code that handles sqlite database queries. This is organised into separate files with each containing a class corresponding to a certain theme, e.g. ./meta/archive.py handles queries related to the archive.

These classes derive from the BaseMeta class in ./meta/base.py. This defines private methods to perform DQL (SELECT) and DML (INSERT/UPDATE/DELETE) queries. It also has code to control transactions, check and setup tables, and interact with the sqlite_context. DO NOT execute queries directly, use the _simple_select_query(), _perform_dml_query(), and other methods in BaseMeta because this is how we ensure consistency and control over transactions. The exception is DDL (create tables or indexes) used in setup_tables().

In ./meta/sqlite_context.py, the connection handler is to be found. This is a singleton instantiation of the _MetaSqliteContext class. Notably, entry_point() in ./__init__.py performs explicit open and close on the sqlite_context: this is so that the db handle is only open when queries are being performed and not kept open for the whole lifetime of the module import.

Finally ./meta/setup_meta_db.py calls .setup_table() on each of the Meta classes. This is called when there's a fresh db and the tables, indexes, etc, need to be setup.

Common ./common/

This is where most of the code that is used module-wide is kept.

  • common_functions.py: As it says on the tin. Includes some custom type-checking, calculating hash of a file, etc.
  • conda_functions.py: Well.
  • definitions.py: This is where the Oid, LocalFileRelaxed, and LocalFileComplete are defined.
  • enums.py: All enums go in here.
  • query_output.py: Code to dump query output in various formats.

Instructions ./instructions/

This is high-level code that defines an "instruction". Each "instruction" is in a one-to-one relation with the "simple-command" that the user sees.

All instructions are derived, ultimately, from BaseInstruction in ./instructions/base.py. However, there are specialisation in the base classes. For exmaple, BaseFilesInstruction derives from BaseInstruction and is designed to accept Snakemake-style input and output files. BaseQueryInstruction is designed to handle queries. The base classes implement interaction with transactions.

The instructions, in general, are where certain checks are made, e.g. whether there is an active run.

Special mention of ./instructions/info.py should be made since this is not actually an instruction. It's a class defined to store some basic fields --instruction_id, snakemake_rule, simple_command, runid_ext-- that's passed as a parameter to every instruction upon instantiation.

Tasks ./tasks/

This is where a lot of the hard-work is done. Almost all of the tasks are designed to take a list of "resources" and process them in sequence, e.g. a list of HttpDownloadResources and download each HTTP resource in turn.

They derive from BaseTask in ./tasks/base.py.

It's generally better to put code that "do stuff" in tasks because we sometimes want to reuse from different instructions, and this gets a little tricky otherwise.

Resource Definitions ./resource_definitions

These classes define a "resource" such as a remote HTTP resource we might want to download, a resource in aux_resources, etc. They all derive from BaseResource, which in turn derives from Pydantic BaseModel. These are aliased under dm.res.

The resources are instantiated at runtype by the specifications, usually from a YAML file. This means that all resource definition classes must provide the apporpriate code to allow Pydantic to serialise and deserialise. Use existing resource definition classes as examples of what's required.

Specifications ./specifications

This code creates resources according to predefined specifications.

  • HttpDownloadSpecs reads its specifications from the ./specifications/http_download/http_download_resources.yaml file, which is then parse into a list of HttpDownloadResources. If you want to add a specification for another remote HTTP resource you should put it in that file but it must have the correct fields according to HttpDownloadResource.

  • AuxiliarySpecs: reads its specifications from the YAML index in the aux_resources repository.

Archive ./archive

The archive directory contains one file: archiver.py which handles the low-level archive put and pulls. This is designed to be a (module) singleton: the class should only be interacted with via its singlton archiver instantiation.

Most interaction goes through the corresponding tasks or instructions though because there is additional support code there.

Execution Control ./exec_control

  • commands.py: This lists the available simple-commands, query-commands, and associated help. It also has a list of commands that need an active run. These lists are used primarily by argument parser for the CLI.

  • dispatcher.py: This is where simple-commands and queries are sent out for execution via entry_point(). It also implements setup_datamanager() which does as it suggests; this is called in a lazy manner from entry_point() because we need to allow the test system an opportunity to change the var/ directory before setup.

  • external.py: This proceses Snakemake's parameters or CLI arguments into an ExtArgs.

  • misc.py: Setup and close DB connection, check DB tables setup, conda env wrt active run, etc.

File Control ./file_control

  • fileregistry_info_ctrl.py: There is code here that allows interation with the file registry, validation of the regsitry, checking for unarchived files, etc.

  • transaction_ctrl.py: TODO

Local ./local

This, in principle, should contain code that's only relevant to datamanager such as regex configs, the ExtArgs definition, etc.

Usage Guidance and Notes

Aliasing of Useful Stuff into dm.

  • ExtArgs, ExtArgsCommandStr, ExtArgsQueryStr: the datamanager's version of these classes.
  • __version__
  • en: the contents of datamanager.common.enums.
  • ex: the contents of datamanager.common.exceptions.
  • Oid, LocalFileRelaxed, LocalFileComplete, Tlist_of_str_path, get_local_rel_filename_from_oid, get_oid_from_local_rel_filename: The Oid and LocalFile* classes and convenience functions to convert between local relative filename and oid (you shouldn't really need these but they're there TODO link to explanation).
  • cmd: the contents of datamanager.exec_control.quick_fns_simple_command.
  • qry: the contents of datamanager.exec_control.quick_fns_query.
  • entry_point, setup_datamanager: from dispatcher.
  • setup_logging, setup_meta_db: from datamanager.exec_control.misc.
  • res: selected exports from resource definitions.

Implementation Details

Archive Details

The archive uses several database tables to store metadata along with the actual files on disc. The tables are:

  • archive_files_oids: This stores only the oids. It's basically due to canonicalisation to ensure oids are unique in the archive.
  • archive_files_localfiles: The fields reflect what's in the LocalFile class, without hash (because this is related to the revision of a file) and file_area (which doesn't make sense for an archived file).
  • archive_files_revisions: This keeps track of the different revisions/versions of each file. This stores the unqiue hash and the unique filename (unique_rel_filename). Remember, there can be several revision records for the same oid.
  • archive_files_metafiles: This associates any YAML meta files with its data file. This links to the revisions record, and stores the meta_filename, etc.
  • archive_files_savepoints: This is the record of all savepoints. For example, the same file with the same hash may be archived several times, each with a savepoint record, but there'd be only one revision record. The savepoints table allows use to keep a record of "put" operations into the archive, properly link files with runids, and record which revision of a file is the "active" one, i.e. which revision is supplied in a "pull" operation.

The files in the archive are stored under their unique_rel_filename which, as noted above, is the local_rel_filename with the first 32 hex digits of the file's sha256 hash appended, e.g. httpresource/macrodb/un/sna-mainagg/gdp-usd-constant-countries.xlsx-f23266e51deba007149f8ce335f78f4a.

Transactions

Transactions are stored across three SQL tables.

  • transactions: This stores records of the transactions with fields including the runid, snakemake_rule, simple_command, and a flag to indicate whether the transaction has been closed.
  • transaction_files: This stores records of all files that may be associated with a transaction. Each record only contains fields pertinent to the file, i.e. oid, local_rel_filename, hash, etc.
  • transaction_relations: This is what links the files to a transaction. It has fields trans_pkid (the primary key in the transactions table), trans_file_id (the primary key in the transaction_files table), and trans_type (indicates type of transaction, see TransTypeEnum).

The reason for doing it this way is that a single file may be associated with more than one transaction and one trasaction may have more than one file associated with it.

The class TransactionCtrl creates and opens existing transactions, returning a TransactionHandle. The TransactionHandle can be used to add files into a transaction, get information about that transaction, and close the transaction.

Conda Environment State Records

The package versions in a specific conda environment can change over time so there needs to be a way to record what was used for previous processing. We associate the conda environment information with runid(s). This information includes the conda environment name, the full package list (including versions), and the "export" list (what would be in the environment YAML file). Retrieving this information from Python is slightly inefficient so we only do it when required, i.e. when creating a runid or before an operation that would create/change files.

The user can choose, in the settings, how strict the checks should be. There are three options, each can be set True or False:

  • check conda environment name (efficient and would expect always to be True);
  • check the full package list (generally overly prescriptive, this would be a "strict" sort of mode) default False; and,
  • check the export list which defaults to True.

String Format

At several places in datamanger, a format is imposed upon strings under processing. This is largely to keep encourage consistency in naming and use. The enforcing of formats is done in two ways: by applying length and regex restrictions to free strings, and by enumeration of tokens.

The imposing of formats on free strings is specified in ./local/configuration.py. There, you can see the specifications for runids, simple-commands, etc.

The enumerated types are in ./common/enums.py.

Storage of Internal Metadata

"Internal metadata" here means metadata that datamanager uses to keep track of files and objects, e.g. keeping track of the files in the archive as detailed above. It does not mean metadata relating to the content of the files.

Several options were considered and tried. The internal metadata is essentially transaction orientated: there are lots of small row inserts, updates, and deletes. YAML files aren't great for this. Pandas isn't ideal either as it lacks certain desirable features, and it either has to load an entire file into memory or acts as a wrapper layer above another data source. Loading entire files into memory is necessary if doing a lot of random-access numerical calculations but our use-case is row-wise operations. Using it as a wrapper around an SQL datasource frustrates the sought fine control over transactions, etc. In the end, using sqlite directly via the standard Python DB-API was found to be an excellent option as it affords certain useful properties.

  • Strong foreign key relations and constraints can be imposed. This is useful for enforcing internal data integrity and consistency in the higher-level Python code. For example, forcing that the specifiction of the runid is consistent across tables (it's normalised) prevents accidental erroneous data entry or programming mistakes.
  • sqlite files are inherently persistent and have some protection against corruption in the case of abnormal program termination by using Write-Ahead Logging (WAL). This is not possible to achieve with, e.g. Pandas, without wasting time coding what is quite complicated ancillarly logic to perform the same task.
  • sqlite allows for complete read concurrency and page-wise write concurrency. Although this requirement can in principle be excluded by design choice, if EXIOBASE is to be used by a wider audience then it's difficult to exclude accidental concurrent use and it also affords some flexibility in the future (at no cost).
  • sqlite allows multiple operations to be grouped in an ACID transaction. This, again, can help avoid inconsistencies in the internal metadata: operations are either committed fully, or are rolled-back.
  • If the metadata files become larger, sqlite will be efficient, i.e. instead of loading a large file into memory to change only a few rows then having to write the entire file back to disc, sqlite only modifies the necessary pages.
  • Using SQL means queries are consistent and can rely on the efficient C implementation in sqlite.

There is also no requirement for a database server and the sqlite libraries are ubiquitous.

Not Possible to Update Schema Inplace

Due to restrictions in sqlite, it's not possible to update the DB schema inplace without some difficulty. You should bear this in mind if you have to alter the schme during development: you'll need to have some mechanism to update the meta DB in existing installation(s).

The datamanager code will, however, create a new schema for a new DB.

Note on Sqlite Concurrency

Sqlite supports Write-Ahead Logging (WAL) so this issue isn't as bad as it could be. We use deferred mode, see the sqlite docs here.

Read operations don't cause contention problems. However, as soon as a write operation happens within a transaction then the db is locked until that transaction is either committed or rolled-back. Therefore, if process A is in the middle of any transaction that includes a write and process B attempts a write, then process B's write for up to timeout seconds or until process A commits its transaction. If process A commits within process B's timeout, then process B's write will proceed (opening a write transaction for process B).

The upshot of this is that we use quite a long timeout (see sqlite_context.py) because we'll be spawning datamanager in several separate processes, and these may be attempting concurrent writes. This should, in principle, be ok. At times though, it appears that sqlite can get confused and that will require closing all connections and/or processes accessing the db, which would also cause a pipeline run to fail.

Appendices

Appendix A: Aliases

dm.cmd: Commands, wrapped in functions for easy access. dm.qry: Queries, wrapped in functions for each access. dm.res: Resources. dm.en: Enumerated types. dm.ex: Custom exceptions.