Monitoring your HPC CI CD infrastructure with GitLab - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Prometheus metrics for GitLab Runners can be incredibly useful for monitoring your HPC CI/CD infrastructure! Let me break down how this works and its applicability to HPC systems.

GitLab Runner Prometheus Metrics

Basic Setup

In your config.toml:

listen_address = ":9252"  # Enables metrics endpoint

[runners](/TerrenceMcGuinness-NOAA/global-workflow/wiki/runners)
  name = "gaea68"
  # ... your existing config

This exposes metrics at http://gaea68:9252/metrics

Key Metrics for HPC Monitoring

# Job execution metrics
gitlab_runner_jobs_total{runner="gaea68",state="success"}
gitlab_runner_jobs_total{runner="gaea68",state="failed"}
gitlab_runner_job_duration_seconds{runner="gaea68"}

# Resource utilization
gitlab_runner_concurrent_requests{runner="gaea68"}
gitlab_runner_limit{runner="gaea68"}

# System health
gitlab_runner_version_info{runner="gaea68",version="16.5.0"}

HPC-Specific Considerations

Networking Challenges

Most HPC systems have restricted networking:

# Gaea likely has limited outbound access
# You'd need to:
1. Configure Prometheus server within the HPC network
2. Or use a jump host/proxy for external monitoring
3. Or export metrics to a shared filesystem

Better HPC Monitoring Approach

Since HPC systems often can't reach external Prometheus servers, consider:

1. File-based Metrics Export

# In config.toml
[runners.monitoring]
  metrics_server = "file:///gpfs/f6/drsa-precip3/world-shared/global/CI/metrics"

2. Custom Metrics Collection

# In your .gitlab-ci.yml
after_script:
  - |
    # Export custom HPC metrics
    echo "gfs_job_duration_seconds $(date +%s)" >> /shared/metrics/gfs_metrics.prom
    echo "gfs_job_status{job=\"$CI_JOB_NAME\",status=\"$CI_JOB_STATUS\"} 1" >> /shared/metrics/gfs_metrics.prom
    
    # SLURM-specific metrics
    if [ -n "$SLURM_JOB_ID" ]; then
      sacct -j $SLURM_JOB_ID --format=JobID,Elapsed,MaxRSS,MaxVMSize --parsable2 >> /shared/metrics/slurm_metrics.csv
    fi

3. HPC-Native Monitoring Integration

Many HPC centers use tools like:

Ganglia - Often already deployed on HPC systems
Nagios/Icinga - For infrastructure monitoring
SLURM accounting - Job resource tracking

You could integrate GitLab metrics with these:

# Export to SLURM accounting
scontrol update job=$SLURM_JOB_ID comment="GitLab_Pipeline_ID=$CI_PIPELINE_ID"

# Export to Ganglia
gmetric --name="gitlab_job_duration" --value=$JOB_DURATION --type=uint32 --units=seconds

4. Custom Dashboard for HPC

Create HPC-specific dashboards showing:

# Queries relevant to weather modeling
rate(gitlab_runner_jobs_total{runner=~"hera|gaea.*"}[5m])
histogram_quantile(0.95, gitlab_runner_job_duration_seconds{job=~".*forecast.*"})

# SLURM integration metrics
slurm_job_queue_time_seconds{partition="batch"}
slurm_node_utilization{cluster="gaea"}

Practical Implementation for Your Setup

Phase 1: Local Metrics

# config.toml addition
listen_address = "127.0.0.1:9252"  # Localhost only for security

Phase 2: Custom Weather Model Metrics

# .gitlab-ci.yml
variables:
  METRICS_DIR: "/gpfs/f6/drsa-precip3/world-shared/global/CI/metrics"

gfs_forecast:
  script:
    - start_time=$(date +%s)
    - ./run_gfs_forecast.sh
    - end_time=$(date +%s)
    - duration=$((end_time - start_time))
    - echo "gfs_forecast_duration_seconds $duration" >> $METRICS_DIR/gfs.prom

Phase 3: Integration with HPC Monitoring

Work with your HPC system administrators to:

Expose metrics through existing monitoring infrastructure
Set up alerting for failed CI jobs
Create dashboards showing CI health alongside system health

This gives you powerful insights into your CI/CD performance on HPC systems while working within the constraints of those environments.