MPMD_MPI_Runtime_Infrastructure - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
MPMD & MPI Runtime Infrastructure in Global-Workflow
Document Version: 1.0
Date: January 30, 2026
Authors: MCP/RAG Analysis System
Repository: NOAA-EMC/global-workflow
Table of Contents
- Executive Summary
- Architecture Overview
- Core MPMD Script
- HPC Platform Configurations
- MPI Runtime Environment Details
- Job Resource Configuration Chain
- MPMD Use Cases
- Network Fabric and Interconnect
- MPI Tuning Parameters
- Command File Format Transformation
- Troubleshooting Guide
Executive Summary
The Global Workflow implements a sophisticated Multi-Program Multiple-Data (MPMD) execution framework that enables parallel execution of heterogeneous tasks across diverse HPC platforms. This infrastructure abstracts away platform-specific MPI implementations, allowing identical workflow scripts to run on Slurm-based R&D systems (Hera, Hercules, Orion, Gaea) and PBS-based production systems (WCOSS2) without modification.
Key Capabilities
- Platform Abstraction: Single codebase supports 11+ HPC platforms
- Dual Launcher Support: Slurm
srun --multi-progand PBSmpiexec cfp - Dynamic Resource Binding: Automatic thread/task configuration per job step
- Fault Tolerance: Per-rank output capture for debugging failed tasks
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Global Workflow MPI Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Job Script (ex*.sh) β
β β β
β βΌ β
β βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββ β
β β config.resourcesββββββΊβ ${MACHINE}.env ββββββΊβ run_mpmd.sh β β
β β (ntasks, nodes) β β (launcher, opts) β β (CFP orchestration) β β
β βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MPI Runtime Layer β β
β βββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β Slurm β srun --multi-prog β β
β β PBS β mpiexec cfp β β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Hierarchy
global-workflow/
βββ ush/
β βββ run_mpmd.sh # Core MPMD orchestration script
βββ env/
β βββ HERA.env # Hera MPI configuration
β βββ HERCULES.env # Hercules MPI configuration
β βββ ORION.env # Orion MPI configuration
β βββ URSA.env # Ursa MPI configuration
β βββ WCOSS2.env # WCOSS2 MPI configuration (production)
β βββ GAEAC5.env # Gaea C5 MPI configuration
β βββ GAEAC6.env # Gaea C6 MPI configuration
β βββ AWSPW.env # AWS ParallelWorks configuration
β βββ AZUREPW.env # Azure ParallelWorks configuration
β βββ GOOGLEPW.env # Google Cloud ParallelWorks configuration
β βββ CONTAINER.env # Container environment
βββ dev/
β βββ parm/config/
β β βββ gfs/
β β βββ config.resources # Base resource definitions
β β βββ config.resources.HERA # Hera overrides
β β βββ config.resources.WCOSS2 # WCOSS2 overrides
β β βββ ...
β βββ workflow/
β βββ hosts.py # Python Host class
β βββ hosts/
β βββ hera.yaml # Hera host configuration
β βββ wcoss2.yaml # WCOSS2 host configuration
β βββ ...
βββ modulefiles/
βββ gw_run.hera.lua # Hera runtime modules
βββ gw_run.wcoss2.lua # WCOSS2 runtime modules
βββ ...
Core MPMD Script: ush/run_mpmd.sh
The run_mpmd.sh script is the central orchestrator for MPMD execution. It provides a unified interface that adapts to the underlying MPI implementation.
Script Location
ush/run_mpmd.sh
Author
Rahul Mahajan (NCEP/EMC)
Environment Variables
| Variable | Description | Default |
|---|---|---|
USE_CFP |
Enable MPMD mode (YES) or serial (NO) |
NO |
launcher |
MPI launcher command | Empty (serial) |
mpmd_opt |
Launcher-specific MPMD options | Empty |
DATA |
Working directory for temporary files | Required |
Execution Flow
#!/usr/bin/env bash
# Simplified logic flow
cmdfile=$1
# Serial fallback if CFP disabled
if [ "${USE_CFP:-}" != "YES" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${USE_CFP:-}"-!=-"YES"-); then
bash "${cmdfile}"
exit $?
fi
# Force single-threaded for MPMD
export OMP_NUM_THREADS=1
nprocs=$(wc -l < "${cmdfile}")
if [ "${launcher:-}" =~ ^srun.* ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${launcher:-}"-=~-^srun.*-); then
# SLURM: Prepend rank numbers
nm=0
while IFS= read -r line; do
echo "${nm} ${line}" >> "${mpmd_cmdfile}"
((nm++))
done < "${cmdfile}"
${launcher} ${mpmd_opt} -n ${nprocs} "${mpmd_cmdfile}"
elif [ "${launcher:-}" =~ ^mpiexec.* ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${launcher:-}"-=~-^mpiexec.*-); then
# PBS: Wrap with output redirection
echo "#!/bin/bash" > "${mpmd_cmdfile}"
nm=0
while IFS= read -r line; do
echo "${line} > mpmd.${nm}.out 2>&1" >> "${mpmd_cmdfile}"
((nm++))
done < "${cmdfile}"
chmod 755 "${mpmd_cmdfile}"
${launcher} -np ${nprocs} ${mpmd_opt} "${mpmd_cmdfile}"
else
# Unsupported launcher: fall back to serial
bash "${cmdfile}"
fi
HPC Platform Configurations
Platform Matrix
| Platform | Scheduler | MPI Library | Launcher Command | MPMD Options | CFP Support |
|---|---|---|---|---|---|
| WCOSS2 | PBS Pro | Cray MPICH | mpiexec -l |
--cpu-bind verbose,core cfp |
β Native |
| Hera | Slurm | Intel MPI | srun -l --export=ALL --hint=nomultithread |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Hercules | Slurm | Intel MPI | srun -l --export=ALL --hint=nomultithread |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Orion | Slurm | Intel MPI | srun -l --export=ALL --hint=nomultithread |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Ursa | Slurm | Intel MPI | srun -l --export=ALL --hint=nomultithread |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Gaea C5 | Slurm | Cray MPICH | srun -l --export=ALL --distribution=block:block |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Gaea C6 | Slurm | Cray MPICH | srun -l --export=ALL --distribution=block:block |
--multi-prog --output=mpmd.%j.%t.out |
β |
| AWS PW | Slurm | PMI2 | srun -l --export=ALL --hint=nomultithread --mpi=pmi2 |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Azure PW | Slurm | PMI2 | srun -l --export=ALL --hint=nomultithread |
--multi-prog --output=mpmd.%j.%t.out |
β |
| Google PW | Slurm | PMI2 | srun -l --export=ALL --hint=nomultithread |
--multi-prog --output=mpmd.%j.%t.out |
β |
Supported Resolutions by Platform
All Tier 1 and Tier 2 platforms support:
- C1152 (13 km)
- C768 (25 km)
- C384 (50 km)
- C192 (100 km)
- C96 (200 km)
- C48 (400 km)
MPI Runtime Environment Details
WCOSS2 (NCO Production)
File: env/WCOSS2.env
export launcher="mpiexec -l"
export mpmd_opt="--cpu-bind verbose,core cfp"
Module Stack: (modulefiles/gw_run.wcoss2.lua)
load("PrgEnv-intel") -- Intel programming environment
load("craype") -- Cray programming environment
load("intel") -- Intel compilers
load("cray-mpich") -- Cray MPI implementation
load("cray-pals") -- Parallel Application Launch Service
load("cfp") -- Coupled Framework Parallelism
setenv("USE_CFP","YES") -- Enable CFP by default
Key MPI Tuning:
export OMP_PLACES=cores
export OMP_STACKSIZE=1G
export FI_OFI_RXM_SAR_LIMIT=3145728
export MPICH_MPIIO_HINTS="*:romio_cb_write=enable"
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000
Hera (NOAA RDHPCS)
File: env/HERA.env
export launcher="srun -l --export=ALL --hint=nomultithread"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
Key MPI Tuning:
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_LIST=lustre
Hercules (MSU HPC)
File: env/HERCULES.env
export launcher="srun -l --export=ALL --hint=nomultithread"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
Identical MPI tuning to Hera with Intel MPI optimizations for Lustre filesystem.
Gaea C6 (RDHPCS Cray)
File: env/GAEAC6.env
export launcher="srun -l --export=ALL --distribution=block:block"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
# Cray Slingshot tuning
export FI_CXI_RX_MATCH_MODE=hybrid
# For large jobs with collective issues:
# export FI_CXI_RX_MATCH_MODE=software
AWS ParallelWorks (Cloud)
File: env/AWSPW.env
export launcher="srun -l --export=ALL --hint=nomultithread --mpi=pmi2"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
Note the --mpi=pmi2 flag for PMI2 process management on cloud Slurm clusters.
Job Resource Configuration Chain
Three-Level Configuration
- Base Resources (
dev/parm/config/gfs/config.resources) - Platform Overrides (
dev/parm/config/gfs/config.resources.${MACHINE}) - Environment Binding (
env/${MACHINE}.env)
Resource Variables
| Variable | Description | Example |
|---|---|---|
ntasks |
Total MPI tasks | 480 |
tasks_per_node |
Tasks per compute node | 40 |
max_tasks_per_node |
Maximum tasks per node (hardware) | 40 |
threads_per_task |
OpenMP threads per MPI task | 1 |
memory |
Memory requirement | "96GB" |
APRUN Variables by Job Step
| Step | APRUN Variable | Typical Configuration |
|---|---|---|
fcst |
APRUN_UFS |
${launcher} -n ${ufs_ntasks} |
anal |
APRUN_GSI |
${APRUN_default} --cpus-per-task=${NTHREADS_GSI} |
eupd |
APRUN_ENKF |
${launcher} -n ${ntasks_enkf} --cpus-per-task=${NTHREADS_ENKF} |
upp |
APRUN_UPP |
${APRUN_default} --cpus-per-task=${NTHREADS_UPP} |
| CFP jobs | APRUNCFP |
${launcher} -n $ncmd ${mpmd_opt} |
Example: Hera C384 Analysis Configuration
# config.resources.HERA
case ${step} in
"anal")
export ntasks=270
export threads_per_task=8
export tasks_per_node=$(( max_tasks_per_node / threads_per_task ))
;;
esac
# HERA.env
export NTHREADS_GSI=${NTHREADSmax}
export APRUN_GSI="${APRUN_default} --cpus-per-task=${NTHREADS_GSI}"
# Result: srun -l --export=ALL --hint=nomultithread -n 270 --cpus-per-task=8
MPMD Use Cases in Global-Workflow
Scripts Using MPMD
| Script | MPMD Purpose | Parallel Tasks |
|---|---|---|
exglobal_atmos_analysis.sh |
GSI observation processing | Multiple obs types |
exglobal_diag.sh |
Diagnostic file generation | Parallel file I/O |
exgdas_atmos_chgres_forenkf.sh |
EnKF regridding | Per-member regrid |
exglobal_atmos_ensstat.sh |
Ensemble statistics | Parallel stat calc |
exgfs_wave_init.sh |
Wave grid definition | Multiple wave grids |
exgfs_wave_prep.sh |
Wave boundary prep | Current generation |
exgfs_wave_post_gridded_sbs.sh |
Wave post-processing | Grid output |
exgfs_wave_post_pnt.sh |
Point output processing | Spectral extraction |
exglobal_atmos_products.sh |
GRIB2 products | Parallel wgrib2 |
Example: Wave Initialization MPMD
# exgfs_wave_init.sh
for grdID in ${wavegrids}; do
echo "${USHgfs}/wave_grid_moddef.sh ${grdID}" >> mpmd_script
done
"${USHgfs}/run_mpmd.sh" "${DATA}/mpmd_script"
Example: Atmospheric Products MPMD
# exglobal_atmos_products.sh
export USE_CFP="YES"
# Break tmpfile into processor-specific chunks
split -n "l/${nprocs}" -d "${tmpfile}" "${DATA}/cmdfile_"
# Run with MPMD
"${USHgfs}/run_mpmd.sh" "${DATA}/cmdfile"
Network Fabric and Interconnect
| Platform | Network Fabric | Bandwidth | MPI Transport |
|---|---|---|---|
| WCOSS2 | HPE Slingshot | 200 Gb/s | Cray MPICH/libfabric |
| Hera | InfiniBand HDR | 200 Gb/s | Intel MPI/Verbs |
| Hercules | InfiniBand | 100 Gb/s | Intel MPI/Verbs |
| Orion | InfiniBand | 100 Gb/s | Intel MPI/Verbs |
| Gaea C5/C6 | Slingshot 11 | 200 Gb/s | Cray MPICH/CXI |
| AWS PW | EFA | 100-400 Gb/s | PMI2/libfabric |
Filesystem Integration
| Platform | Parallel Filesystem | MPI I/O Tuning |
|---|---|---|
| WCOSS2 | GPFS | romio_cb_write=enable |
| Hera | Lustre | I_MPI_EXTRA_FILESYSTEM_LIST=lustre |
| Hercules | Lustre | I_MPI_EXTRA_FILESYSTEM_LIST=lustre |
| Gaea | Lustre | FI_CXI_RX_MATCH_MODE=hybrid |
MPI Tuning Parameters
Intel MPI (Hera, Hercules, Orion)
# GSI-specific bug workaround
export I_MPI_ADJUST_ALLREDUCE=5
# MKL threading
export MKL_NUM_THREADS=4
export MKL_CBWR=AUTO
# Buffer tuning
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
Cray MPICH (WCOSS2, Gaea)
# Thread placement
export OMP_PLACES=cores
export OMP_STACKSIZE=1G
# libfabric tuning
export FI_OFI_RXM_SAR_LIMIT=3145728
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000
# Collective I/O
export MPICH_MPIIO_HINTS="*:romio_cb_write=enable"
Stack Size Configuration
# All platforms
ulimit -s unlimited
# OpenMP stack
export OMP_STACKSIZE=2048000 # Hera/Hercules
export OMP_STACKSIZE=1G # WCOSS2
export OMP_STACKSIZE=2048M # Forecast jobs
Command File Format Transformation
Input Format (Generic)
/path/to/command1 arg1 arg2
/path/to/command2 arg1 arg2
/path/to/command3 arg1 arg2
Slurm --multi-prog Format
0 /path/to/command1 arg1 arg2
1 /path/to/command2 arg1 arg2
2 /path/to/command3 arg1 arg2
Each line prefix indicates the MPI rank that executes that command.
PBS cfp Format
#!/bin/bash
/path/to/command1 arg1 arg2 > mpmd.0.out 2>&1
/path/to/command2 arg1 arg2 > mpmd.1.out 2>&1
/path/to/command3 arg1 arg2 > mpmd.2.out 2>&1
Output is redirected per-rank for debugging.
Troubleshooting Guide
Common Issues
1. MPMD Tasks Not Running in Parallel
Symptom: Tasks execute serially despite MPMD configuration
Check:
echo "USE_CFP=${USE_CFP}"
echo "launcher=${launcher}"
Solution: Ensure USE_CFP=YES is set and launcher is properly configured.
2. Task Failures with No Output
Symptom: MPMD job fails but no error messages
Check: Look for per-rank output files:
ls -la mpmd.*.out
cat mpmd.0.out
3. Memory Errors on MPMD Jobs
Symptom: OOM killer or memory allocation failures
Solution: Reduce tasks per node:
export tasks_per_node=20 # Instead of 40
4. Slurm Multi-Prog Syntax Errors
Symptom: Invalid job configuration
Check: Ensure command file has no blank lines and correct format:
cat ${DATA}/mpmd_cmdfile
5. CFP Not Found on WCOSS2
Symptom: cfp: command not found
Solution: Load the CFP module:
module load cfp
Debug Mode
Enable verbose output:
export MPMD_DEBUG=1
set -x
"${USHgfs}/run_mpmd.sh" "${DATA}/cmdfile"
References
Appendix A: Complete Environment File Reference
HERA.env Key Settings
export launcher="srun -l --export=ALL --hint=nomultithread"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
WCOSS2.env Key Settings
export launcher="mpiexec -l"
export mpmd_opt="--cpu-bind verbose,core cfp"
GAEAC6.env Key Settings
export launcher="srun -l --export=ALL --distribution=block:block"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
export FI_CXI_RX_MATCH_MODE=hybrid
Appendix B: Host YAML Configuration
hera.yaml
SCHEDULER: slurm
QUEUE: batch
PARTITION_BATCH: hera
PARTITION_SERVICE: service
SUPPORTED_RESOLUTIONS: ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48']
wcoss2.yaml
SCHEDULER: pbspro
QUEUE: 'dev'
QUEUE_SERVICE: 'dev_transfer'
SUPPORTED_RESOLUTIONS: ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48']
Document generated by MCP/RAG Analysis System
Last updated: January 30, 2026