Monitoring your HPC CI CD infrastructure with GitLab - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Prometheus metrics for GitLab Runners can be incredibly useful for monitoring your HPC CI/CD infrastructure! Let me break down how this works and its applicability to HPC systems.
GitLab Runner Prometheus Metrics
Basic Setup
In your config.toml
:
listen_address = ":9252" # Enables metrics endpoint
[runners](/TerrenceMcGuinness-NOAA/global-workflow/wiki/runners)
name = "gaea68"
# ... your existing config
This exposes metrics at http://gaea68:9252/metrics
Key Metrics for HPC Monitoring
# Job execution metrics
gitlab_runner_jobs_total{runner="gaea68",state="success"}
gitlab_runner_jobs_total{runner="gaea68",state="failed"}
gitlab_runner_job_duration_seconds{runner="gaea68"}
# Resource utilization
gitlab_runner_concurrent_requests{runner="gaea68"}
gitlab_runner_limit{runner="gaea68"}
# System health
gitlab_runner_version_info{runner="gaea68",version="16.5.0"}
HPC-Specific Considerations
Networking Challenges
Most HPC systems have restricted networking:
# Gaea likely has limited outbound access
# You'd need to:
1. Configure Prometheus server within the HPC network
2. Or use a jump host/proxy for external monitoring
3. Or export metrics to a shared filesystem
Better HPC Monitoring Approach
Since HPC systems often can't reach external Prometheus servers, consider:
1. File-based Metrics Export
# In config.toml
[runners.monitoring]
metrics_server = "file:///gpfs/f6/drsa-precip3/world-shared/global/CI/metrics"
2. Custom Metrics Collection
# In your .gitlab-ci.yml
after_script:
- |
# Export custom HPC metrics
echo "gfs_job_duration_seconds $(date +%s)" >> /shared/metrics/gfs_metrics.prom
echo "gfs_job_status{job=\"$CI_JOB_NAME\",status=\"$CI_JOB_STATUS\"} 1" >> /shared/metrics/gfs_metrics.prom
# SLURM-specific metrics
if [ -n "$SLURM_JOB_ID" ]; then
sacct -j $SLURM_JOB_ID --format=JobID,Elapsed,MaxRSS,MaxVMSize --parsable2 >> /shared/metrics/slurm_metrics.csv
fi
3. HPC-Native Monitoring Integration
Many HPC centers use tools like:
- Ganglia - Often already deployed on HPC systems
- Nagios/Icinga - For infrastructure monitoring
- SLURM accounting - Job resource tracking
You could integrate GitLab metrics with these:
# Export to SLURM accounting
scontrol update job=$SLURM_JOB_ID comment="GitLab_Pipeline_ID=$CI_PIPELINE_ID"
# Export to Ganglia
gmetric --name="gitlab_job_duration" --value=$JOB_DURATION --type=uint32 --units=seconds
4. Custom Dashboard for HPC
Create HPC-specific dashboards showing:
# Queries relevant to weather modeling
rate(gitlab_runner_jobs_total{runner=~"hera|gaea.*"}[5m])
histogram_quantile(0.95, gitlab_runner_job_duration_seconds{job=~".*forecast.*"})
# SLURM integration metrics
slurm_job_queue_time_seconds{partition="batch"}
slurm_node_utilization{cluster="gaea"}
Practical Implementation for Your Setup
Phase 1: Local Metrics
# config.toml addition
listen_address = "127.0.0.1:9252" # Localhost only for security
Phase 2: Custom Weather Model Metrics
# .gitlab-ci.yml
variables:
METRICS_DIR: "/gpfs/f6/drsa-precip3/world-shared/global/CI/metrics"
gfs_forecast:
script:
- start_time=$(date +%s)
- ./run_gfs_forecast.sh
- end_time=$(date +%s)
- duration=$((end_time - start_time))
- echo "gfs_forecast_duration_seconds $duration" >> $METRICS_DIR/gfs.prom
Phase 3: Integration with HPC Monitoring
Work with your HPC system administrators to:
- Expose metrics through existing monitoring infrastructure
- Set up alerting for failed CI jobs
- Create dashboards showing CI health alongside system health
This gives you powerful insights into your CI/CD performance on HPC systems while working within the constraints of those environments.