Bundle Lifecycle - CivicSpleen/ambry GitHub Wiki

Bundle Lifecycle

The design document describes a change to Ambry to make it easier ( possible ) to manage bundle entriely within the library, a change that is important to allow for programatic creation of bundles -- such as for scraping an entire open data repository -- and for editing bundles online with a web application. These changes will also me it easier to build bundles in a distributed fashion.

As a result the library will be the the definitive, central location for bundle configuration, and bundles will be built from the library, rather than configuration files.

Source bundles are installed into a library, and the build process uses the library configuration to build the bundle, getting all source and configuration from the library. The source bundle in the library gets updates when the built bundle is created.

Terms

  • Library bundle. A bundle with its dataset record in the library
  • Build Bundle. A Sqlite bundle that is built from a source bundle. It holds the metadata for the dataset, but not the data.
  • Edit Bundle. A Source bundle, unpacked to files, for editing, Usually a directory with bundle.py and bundle.yaml

Development Lifecycle

  1. An Edit bundle is cloned from git.
  2. The configuration of the Edit Bundle is loaded into the Library. If the developer changes the files, a watchdog program detects the changes to the Edit Bundle and loads the change into the Library, or the developer can force a re-synchronization
  3. The build code uses the Library Bundle to create a Build Bundle
  4. The build bundle creates partitions and build them, posibly on another machine or process.
  5. After the partition build completes, and changes to the partition state or table schema values is synchronized back to the Build Bundle.
  6. The build bundle is synchronized bac to the Library Bundle.
  7. The Library Bundle is synced with the Edit Bundle.
  8. The Edit bundle is committed to git.

When the develop is finished with the bundle and ready to release, the developer pushes a prepared, but unbuilt Build Bundle to a source library. A Source Library is a file store ( usually S3 ) that holds build bundle before they are built.

Production Lifecycle

In production, a production build server will get a source build bundle from the Source Library, load it into the Local Library, and build it. If the build is successful, the build server pushes the built bundles to the a Remote Library.

Production builds are inherently multi-threaded or multi-processor. Each build process work only from a Partition database. The partitions are created first, then they are loaded with all of the configuration that they'll need to built, primarily source entries ( from the source key in the metadata ). The partitions also get the code from the build.py file, and a dictionary of data to use to call a build_partitions() method.

The partition can be built with no other information than what is in the partition database, and what can be downloaded based on information in the partition. As a result, the partitions can be sent to a buld farm to be built.

( This imples that the prepare phase will need to be able to identify all of the partitions that will be built, before the build begins. )

Build Bundles

Current build bundles have an internal directory structure, primarily a meta directory. The new ones will always be flat, with no subdirectories. The files that the the bundle uses and tracks are:

  • build.py: code file for building the bundle
  • meta.py: code file for one-time manipulations.
  • bundle.yaml: main configuration
  • meta.yaml: Main configuration, plus generated configuration
  • schema.csv: tables and columns
  • column_map.csv: Maps columns names from the source to the schema.
  • documentation.md: Auxiliary documentation

In the build state, there are a few broad categories for bundles:

  • Source Build Bundles are built from external files, referenced in the source section of the metadata.
  • Dependent Build Bundles are build from other bundles, referenced in the dependencies section of the metadata.

Source Build Bundles usually load their files in without modification. Modifications, additional columns, and other transformations are performed by a Dependent Build Bundle.

Source bundles should have a variant name of 'source'. Dependent bundles normally don't have a variant name.

Updates

A major reason for moving to a more library-centric model is to make it easier to have bundles that automatically update. An updatable bundle is one that has dependencies on another bundle, or source files, can detect when a dependency changes, and can be queued for rebuilding.

There is no support for chaining dependencies -- if a bundle that is updates is a dependency for other bundles, the dependent bundles are not notified. Instead, there is a regular process for checking what bundles need to be updated, and the dependent bundles will get updated on the next run of the process.