Dataset versions - cma-open/cmatools GitHub Wiki

Python packages should follow Semantic versioning for software releases.

For scientific datasets the output dataset product should also be version controlled.

To ensure quality control and replicability scientific software systems and resulting data prdoucts should be carefully named and versioned. The linkage between a named python package version and a named dataset version will vary if ther is a 1 to 1 or 1 to many relationship from software system to dataset output.

Python package: single output dataset

  • Maintain a close association between python package version and dataset version
  • Name and version the dataset the same as the python package version

e.g.

  • Package: geospatial 1.2.0
  • Dataset: geospatial 1.2.0

Python packages: several output datasets

  • In these situations it may cause confusion if the package is named the same as only one of the output datasets, and linking the software version to the dataset version may be complex.
  • It may be preferable for the package to be named differently, such as to reflect the generic processing method, and the output datasets then named seperately to avoid confusion
  • Whilst the datsets could be named seperately they could be versioned by the parent python package, to ensure the creatino of the dataset is traceable and able to be re-created
  • A version metadata file should be maintained to track the linkage between package version and dataset version

e.g.

  • Package: geospatial 1.2.0
  • Dataset-land: land 1.2.0
  • Dataset-marine: marine 1.2.0