GitLab Pipeline MultiHost Architecture for Global‐Workflow - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Multi-Host GitLab CI Architecture for Global Workflow

Overview

The Global Workflow project uses a sophisticated templated GitLab CI architecture that enables parallel testing across multiple high-performance computing (HPC) platforms. This design promotes code reuse, maintainability, and easy extensibility when adding new computing hosts.

Architecture Components

1. Main Configuration File (.gitlab-ci.yml)

The primary orchestrator that:

  • Defines global pipeline stages (build, setup_tests, run_tests, finalize)
  • Sets pipeline-level variables and defaults
  • Includes specialized configuration files
  • Provides base templates shared across all hosts

2. Specialized Configuration Files

A. .gitlab-ci-hosts.yml - Host-Specific Configurations

  • Purpose: Defines which tests run on which hosts
  • Key Feature: Per-host test case matrices that are easily configurable
  • Extensibility: New hosts can be added by following the established patterns

B. .gitlab-ci-cases.yml - Test Case Templates

  • Purpose: Defines reusable templates for standard experiment test cases
  • Templates: Setup, execution, and finalization logic
  • Scope: End-to-end workflow testing scenarios

C. .gitlab-ci-ctests.yml - CTest Framework

  • Purpose: CMake/CTest-based functional testing
  • Scope: Individual Rocoto job testing with predefined input data
  • Use Case: Quick PR validation via GitHub API

Multi-Host Template Design Pattern

Base Template Structure

All host-specific jobs inherit from shared base templates, ensuring consistency while allowing host-specific customization:

# Base template with common logic
.base_template:
  extends: .base_config
  stage: some_stage
  script:
    - common_logic_here

# Host-specific instantiation
job_name-hostname:
  extends: .base_template
  variables:
    machine: hostname
  tags:
    - hostname
  rules:
    - if: conditions_for_this_host

Host Matrix Configuration

Each host defines its supported test cases through matrix variables:

# Example: Hera host configuration
.hera_cases_matrix: &hera_cases
  - caseName: ["C48_ATM", "C48_S2SW", "C96_atm3DVar", ...]

# Jobs inherit this matrix
run_experiments-hera:
  extends: .run_experiments_template
  parallel:
    matrix: *hera_cases

Supported Computing Platforms

Current Hosts

Host Type Test Cases Supported Special Features
Hera Research HPC Full test suite (12 cases) Complete ocean/wave/aerosol testing
GAEAC6 Research HPC Full test suite (11 cases) AWS cloud integration
Orion Research HPC Reduced set (7 cases) Resource-optimized testing
Hercules Research HPC Standard set (9 cases) Balanced testing coverage

Host-Specific Features

Test Case Distribution Strategy

  • Full Suite Hosts (Hera, GAEAC6): Run comprehensive testing including complex coupled models
  • Optimized Hosts (Orion): Focus on core atmospheric testing with resource constraints
  • Balanced Hosts (Hercules): Standard testing coverage without the most resource-intensive cases

Job Instantiation Process

1. Template Inheritance Chain

graph TD
    A[.base_config] --> B[.setup_experiment_template]
    A --> C[.run_experiments_template]
    A --> D[.build_template]
    
    B --> E[setup_experiments-hera]
    B --> F[setup_experiments-orion]
    B --> G[setup_experiments-hercules]
    
    C --> H[run_experiments-hera]
    C --> I[run_experiments-orion] 
    C --> J[run_experiments-hercules]
    
    D --> K[build-hera]
    D --> L[build-orion]
    D --> M[build-hercules]

2. Dynamic Job Creation

For each host, the CI system automatically creates:

Standard Test Cases (PR Validation)

  • setup_experiments-{host}: Parallel jobs for each test case in the host's matrix
  • run_experiments-{host}: Parallel execution jobs that depend on setup completion
  • finalize_success-{host}: Success reporting and GitHub label management

CTest Framework (Quick Validation)

  • setup_ctests-{host}: CMake/CTest environment preparation
  • run_ctests-{host}: Parallel CTest execution for specific test labels

Build Process (Foundation)

  • build-{host}: Compilation and environment setup for the specific platform

3. Dependency Chain

# Example dependency flow for Hera
build-hera → setup_experiments-hera → run_experiments-hera → finalize_success-hera
build-hera → setup_ctests-hera → run_ctests-hera

Pipeline Execution Modes

Mode 1: PR Cases (PIPELINE_TYPE=pr_cases)

  • Trigger: GitHub PR events via API
  • Scope: Full end-to-end workflow testing
  • Duration: Several hours per host
  • Purpose: Comprehensive validation before merge

Mode 2: CTests (PIPELINE_TYPE=ctests)

  • Trigger: GitHub API for quick validation
  • Scope: Individual Rocoto job testing
  • Duration: Minutes to hours
  • Purpose: Rapid feedback for code changes

Mode 3: Nightly Runs (GFS_CI_RUN_TYPE=nightly)

  • Trigger: GitLab scheduled pipelines
  • Scope: Full regression testing on develop branch
  • Duration: Extended execution with archival
  • Purpose: Continuous integration monitoring

Conditional Execution Logic

Host Selection

rules:
  - if: ($RUN_ON_MACHINES =~ /\bhera\b|all/)  # Run on Hera or all hosts

Pipeline Type Routing

rules:
  - if: $PIPELINE_TYPE == "pr_cases" && $CI_PIPELINE_SOURCE == "trigger"
  - if: $PIPELINE_TYPE == "ctests" && $CI_PIPELINE_SOURCE == "trigger"

GitHub Integration

rules:
  - if: $PR_NUMBER != 0  # Only for actual PRs, not develop branch

Adding New Computing Hosts

Step 1: Define Host Configuration

Add to .gitlab-ci-hosts.yml:

# Define test matrix for new host
.newhost_cases_matrix: &newhost_cases
  - caseName: ["C48_ATM", "C48_S2SW", ...]  # Customize based on host capabilities

# Build job
build-newhost:
  extends: .build_template
  variables:
    machine: newhost
  tags:
    - newhost
  rules:
    - if: ($RUN_ON_MACHINES =~ /\bnewhost\b|all/)

Step 2: Add Test Jobs

# Standard cases
setup_experiments-newhost:
  extends: .setup_experiment_template
  variables:
    machine: newhost
  tags:
    - newhost
  parallel:
    matrix: *newhost_cases
  needs:
    - build-newhost

run_experiments-newhost:
  extends: .run_experiments_template
  # ... similar pattern

Step 3: Configure GitLab Runner

  • Register GitLab runner on the new host
  • Configure runner with appropriate tags
  • Ensure access to required software stack

Step 4: Platform Configuration

Add host-specific configurations in:

  • dev/ci/platforms/config.{newhost}: Environment and module setup
  • env/{NEWHOST}.env: Host-specific environment variables

Error Handling and Reporting

GitHub Integration

  • PR Labels: Automatic labeling based on pipeline state
    • CI-{Host}-Building: During compilation
    • CI-{Host}-Running: During test execution
    • CI-{Host}-Passed: On successful completion
    • CI-{Host}-Failed: On any failure

Failure Reporting

  • Error Log Collection: Automated gathering of failed job logs
  • GitHub Gist Publishing: Public sharing of error details via GitHub Gists
  • PR Comments: Automatic failure notifications with diagnostic links

Cleanup Actions

  • Resource Management: Automatic cleanup of failed experiments
  • State Tracking: Proper handling of experiment lifecycle states
  • Retry Logic: Built-in retry mechanisms for transient failures

Benefits of This Architecture

1. Scalability

  • Easy addition of new hosts without code duplication
  • Parallel execution across multiple platforms
  • Resource-aware test case distribution

2. Maintainability

  • Single source of truth for test logic in templates
  • Host-specific customization through variables and matrices
  • Clear separation of concerns between components

3. Flexibility

  • Different testing modes for different use cases
  • Conditional execution based on trigger type and host availability
  • Configurable test case selection per host

4. Reliability

  • Comprehensive error handling and reporting
  • Integration with GitHub for developer feedback
  • Automated cleanup and resource management

This architecture enables the Global Workflow project to maintain high code quality through comprehensive testing across diverse HPC environments while remaining maintainable and extensible for future computing platforms.