Continuous Integration Practice - HigherOrderMethods/SELF GitHub Wiki

Overview

Continuous Integration refers to the practice of committing code to a repository frequently and testing each commit. In SELF, testing of each commit is meant to answer the following questions

  1. Does the code build ?
  2. Does the code run ?
  3. Does the code get the right answer ?
  4. Does the code perform well ?

The first two questions are easily answered by checking exit codes from stages where SELF is built and run. The latter question requires detailed exercising of each of the routines in SELF to compare the estimated numerical errors against spectral element theory. The last question is equally challenging in that it requires integrating profiling tools with CI tests to assess subroutine performance as a function of the input parameters.

To add to the complexity, we want to ensure that SELF releases answer these questions positively (with a "yes") on a variety of hardware, which currently includes :

  • Serial x86 Platforms
  • Serial x86 + Single Nvidia GPU Platforms
  • Serial x86 + Single AMD GPU Platforms
  • Parallel x86 Platforms (single and multi-node)
  • Parallel x86 + Multi-Nvidia GPU Platforms
  • Parallel x86 + Multi-AMD GPU Platforms

As of June 2021, we only test regularly on x86 and x86+Nvidia platforms due to our team's lack of consistent access to AMD GPU hardware. Additionally, we are currently not set up for "continuous benchmarking" (though it is on our roadmap).

Infrastructure

Continuous Integration infrastructure is maintained on Google Cloud and Github.

Google Cloud

On Google Cloud, we maintain the following resources

  • VPC Network, Subnetwork, and Firewall Rules
  • Service Account (self-cibot)
  • Cloud Build Triggers
  • Identity and Access Management Policies All of this infrastructure is defined in the ci/build_iac/ directory using Terraform. Note that "iac" stands for "Infrastructure as Code".

Build & Test Pipeline

The Cloud Build Triggers are used to automatically trigger builds when pushes are made to the develop and main branches of the repository. The integration between Google Cloud Github is mediated by the Github App. Because of this, each trigger is seen as a Github Actions workflow that is checked on each commit.

Our Google Cloud Build triggers launch a build pipeline specified in ci/cloud-build/gce/cloudbuild.yaml. This build pipeline executes the following tasks :

  1. Build a SELF Docker Image for the target architecture (serial-x86, serial-x86-nvidia, serial-x86-amd, openmpi-x86, openmpi-x86-nvidia, openmpi-x86-amd).
  2. Build a SELF Singularity Image from the Docker image.
  3. Create a Google Compute Engine (GCE) test cluster.
  4. Copy the Singularity image file to the test cluster.
  5. Execute the CI Tests and copy back CI artifacts.
  6. Delete the GCE test cluster.
  7. Inspect the CI artifacts (part of the output is a json with runtime exit codes for each unit test). If unit tests fail, a non-zero exit code is thrown and the build fails at this stage.

We have plans to wrap CI tests in the hpc-toolkit profiler to produce a hotspot profile database with each unit test. This database will be processed with hatchet, converted to a flat-file database and pushed to BigQuery for continuous benchmarking.

SELF Tests

When SELF is built, we create a library and a command line interface. The CLI is primarily used for exercising routines in the library. During testing, we make calls to the self CLI to run workflows for things like grid interpolation, boundary interpolation, differentiation, divergence, gradient, and curl, for scalars, vectors, and tensors in 1-D, 2-D, and 3-D. For all routines, we have options to select test grids that have a given number of elements, a specific polynomial degree, and a specific quadrature (e.g. Legendre-Gauss or Legendre-Gauss-Lobatto). For routines that run derivative operations, we also have options to try Colocation, DG, and CG formulations. With all of these options, there is a large number of individual CLI calls that need to be made to fully exercise the SELF API.

Test Definition

The actual tests that are run are defined in the ci/test/ci.json file. This json usually a custom schema for SELF that allows us to compactly specify which CLI commands we want to run and the ranges for each option that we want to test over.

Test Execution

During the SELF build, the ci/test/ciGenerator.py is used to "unroll" the ci.json file to create a test dictionary that contains keys for capturing runtime exit codes, profiles, and other information about how the command was run. This test dictionary is meant to provide enough information to answer our four motivating questions. The ci/test/ciRun.py script is used to read the unrolled test dictionary, run each test (in serial), and capture exit codes, stdout, and stderr.

All of our container images ship with the ci tests used to validate them. This way, end users can verify their container image works as expected by executing /opt/self/test/ciRun.py within the container.