MPMD_MPI_Runtime_Infrastructure - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

MPMD & MPI Runtime Infrastructure in Global-Workflow

Document Version: 1.0
Date: January 30, 2026
Authors: MCP/RAG Analysis System
Repository: NOAA-EMC/global-workflow


Table of Contents

  1. Executive Summary
  2. Architecture Overview
  3. Core MPMD Script
  4. HPC Platform Configurations
  5. MPI Runtime Environment Details
  6. Job Resource Configuration Chain
  7. MPMD Use Cases
  8. Network Fabric and Interconnect
  9. MPI Tuning Parameters
  10. Command File Format Transformation
  11. Troubleshooting Guide

Executive Summary

The Global Workflow implements a sophisticated Multi-Program Multiple-Data (MPMD) execution framework that enables parallel execution of heterogeneous tasks across diverse HPC platforms. This infrastructure abstracts away platform-specific MPI implementations, allowing identical workflow scripts to run on Slurm-based R&D systems (Hera, Hercules, Orion, Gaea) and PBS-based production systems (WCOSS2) without modification.

Key Capabilities

  • Platform Abstraction: Single codebase supports 11+ HPC platforms
  • Dual Launcher Support: Slurm srun --multi-prog and PBS mpiexec cfp
  • Dynamic Resource Binding: Automatic thread/task configuration per job step
  • Fault Tolerance: Per-rank output capture for debugging failed tasks

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Global Workflow MPI Architecture                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚   Job Script (ex*.sh)                                                       β”‚
β”‚        β”‚                                                                     β”‚
β”‚        β–Ό                                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚ config.resources│────►│   ${MACHINE}.env │────►│ run_mpmd.sh          β”‚ β”‚
β”‚   β”‚ (ntasks, nodes) β”‚     β”‚ (launcher, opts) β”‚     β”‚ (CFP orchestration)  β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                    β”‚                         β”‚              β”‚
β”‚                                    β–Ό                         β–Ό              β”‚
β”‚                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚                           β”‚        MPI Runtime Layer                      β”‚ β”‚
β”‚                           β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚                           β”‚  Slurm β†’ srun --multi-prog                    β”‚ β”‚
β”‚                           β”‚  PBS   β†’ mpiexec cfp                          β”‚ β”‚
β”‚                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Hierarchy

global-workflow/
β”œβ”€β”€ ush/
β”‚   └── run_mpmd.sh              # Core MPMD orchestration script
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ HERA.env                 # Hera MPI configuration
β”‚   β”œβ”€β”€ HERCULES.env             # Hercules MPI configuration
β”‚   β”œβ”€β”€ ORION.env                # Orion MPI configuration
β”‚   β”œβ”€β”€ URSA.env                 # Ursa MPI configuration
β”‚   β”œβ”€β”€ WCOSS2.env               # WCOSS2 MPI configuration (production)
β”‚   β”œβ”€β”€ GAEAC5.env               # Gaea C5 MPI configuration
β”‚   β”œβ”€β”€ GAEAC6.env               # Gaea C6 MPI configuration
β”‚   β”œβ”€β”€ AWSPW.env                # AWS ParallelWorks configuration
β”‚   β”œβ”€β”€ AZUREPW.env              # Azure ParallelWorks configuration
β”‚   β”œβ”€β”€ GOOGLEPW.env             # Google Cloud ParallelWorks configuration
β”‚   └── CONTAINER.env            # Container environment
β”œβ”€β”€ dev/
β”‚   β”œβ”€β”€ parm/config/
β”‚   β”‚   └── gfs/
β”‚   β”‚       β”œβ”€β”€ config.resources          # Base resource definitions
β”‚   β”‚       β”œβ”€β”€ config.resources.HERA     # Hera overrides
β”‚   β”‚       β”œβ”€β”€ config.resources.WCOSS2   # WCOSS2 overrides
β”‚   β”‚       └── ...
β”‚   └── workflow/
β”‚       β”œβ”€β”€ hosts.py                      # Python Host class
β”‚       └── hosts/
β”‚           β”œβ”€β”€ hera.yaml                 # Hera host configuration
β”‚           β”œβ”€β”€ wcoss2.yaml               # WCOSS2 host configuration
β”‚           └── ...
└── modulefiles/
    β”œβ”€β”€ gw_run.hera.lua           # Hera runtime modules
    β”œβ”€β”€ gw_run.wcoss2.lua         # WCOSS2 runtime modules
    └── ...

Core MPMD Script: ush/run_mpmd.sh

The run_mpmd.sh script is the central orchestrator for MPMD execution. It provides a unified interface that adapts to the underlying MPI implementation.

Script Location

ush/run_mpmd.sh

Author

Rahul Mahajan (NCEP/EMC)

Environment Variables

Variable Description Default
USE_CFP Enable MPMD mode (YES) or serial (NO) NO
launcher MPI launcher command Empty (serial)
mpmd_opt Launcher-specific MPMD options Empty
DATA Working directory for temporary files Required

Execution Flow

#!/usr/bin/env bash
# Simplified logic flow

cmdfile=$1

# Serial fallback if CFP disabled
if [ "${USE_CFP:-}" != "YES" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${USE_CFP:-}"-!=-"YES"-); then
    bash "${cmdfile}"
    exit $?
fi

# Force single-threaded for MPMD
export OMP_NUM_THREADS=1
nprocs=$(wc -l < "${cmdfile}")

if [ "${launcher:-}" =~ ^srun.* ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${launcher:-}"-=~-^srun.*-); then
    # SLURM: Prepend rank numbers
    nm=0
    while IFS= read -r line; do
        echo "${nm} ${line}" >> "${mpmd_cmdfile}"
        ((nm++))
    done < "${cmdfile}"
    ${launcher} ${mpmd_opt} -n ${nprocs} "${mpmd_cmdfile}"

elif [ "${launcher:-}" =~ ^mpiexec.* ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${launcher:-}"-=~-^mpiexec.*-); then
    # PBS: Wrap with output redirection
    echo "#!/bin/bash" > "${mpmd_cmdfile}"
    nm=0
    while IFS= read -r line; do
        echo "${line} > mpmd.${nm}.out 2>&1" >> "${mpmd_cmdfile}"
        ((nm++))
    done < "${cmdfile}"
    chmod 755 "${mpmd_cmdfile}"
    ${launcher} -np ${nprocs} ${mpmd_opt} "${mpmd_cmdfile}"

else
    # Unsupported launcher: fall back to serial
    bash "${cmdfile}"
fi

HPC Platform Configurations

Platform Matrix

Platform Scheduler MPI Library Launcher Command MPMD Options CFP Support
WCOSS2 PBS Pro Cray MPICH mpiexec -l --cpu-bind verbose,core cfp βœ… Native
Hera Slurm Intel MPI srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out ❌
Hercules Slurm Intel MPI srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out ❌
Orion Slurm Intel MPI srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out ❌
Ursa Slurm Intel MPI srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out ❌
Gaea C5 Slurm Cray MPICH srun -l --export=ALL --distribution=block:block --multi-prog --output=mpmd.%j.%t.out ❌
Gaea C6 Slurm Cray MPICH srun -l --export=ALL --distribution=block:block --multi-prog --output=mpmd.%j.%t.out ❌
AWS PW Slurm PMI2 srun -l --export=ALL --hint=nomultithread --mpi=pmi2 --multi-prog --output=mpmd.%j.%t.out ❌
Azure PW Slurm PMI2 srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out ❌
Google PW Slurm PMI2 srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out ❌

Supported Resolutions by Platform

All Tier 1 and Tier 2 platforms support:

  • C1152 (13 km)
  • C768 (25 km)
  • C384 (50 km)
  • C192 (100 km)
  • C96 (200 km)
  • C48 (400 km)

MPI Runtime Environment Details

WCOSS2 (NCO Production)

File: env/WCOSS2.env

export launcher="mpiexec -l"
export mpmd_opt="--cpu-bind verbose,core cfp"

Module Stack: (modulefiles/gw_run.wcoss2.lua)

load("PrgEnv-intel")      -- Intel programming environment
load("craype")            -- Cray programming environment
load("intel")             -- Intel compilers
load("cray-mpich")        -- Cray MPI implementation
load("cray-pals")         -- Parallel Application Launch Service
load("cfp")               -- Coupled Framework Parallelism
setenv("USE_CFP","YES")   -- Enable CFP by default

Key MPI Tuning:

export OMP_PLACES=cores
export OMP_STACKSIZE=1G
export FI_OFI_RXM_SAR_LIMIT=3145728
export MPICH_MPIIO_HINTS="*:romio_cb_write=enable"
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000

Hera (NOAA RDHPCS)

File: env/HERA.env

export launcher="srun -l --export=ALL --hint=nomultithread"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"

Key MPI Tuning:

export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256
export MPI_MEMMAP_OFF=1
export MP_STDOUTMODE="ORDERED"
export KMP_AFFINITY=scatter
export OMP_STACKSIZE=2048000
export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_LIST=lustre

Hercules (MSU HPC)

File: env/HERCULES.env

export launcher="srun -l --export=ALL --hint=nomultithread"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"

Identical MPI tuning to Hera with Intel MPI optimizations for Lustre filesystem.

Gaea C6 (RDHPCS Cray)

File: env/GAEAC6.env

export launcher="srun -l --export=ALL --distribution=block:block"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"

# Cray Slingshot tuning
export FI_CXI_RX_MATCH_MODE=hybrid
# For large jobs with collective issues:
# export FI_CXI_RX_MATCH_MODE=software

AWS ParallelWorks (Cloud)

File: env/AWSPW.env

export launcher="srun -l --export=ALL --hint=nomultithread --mpi=pmi2"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"

Note the --mpi=pmi2 flag for PMI2 process management on cloud Slurm clusters.


Job Resource Configuration Chain

Three-Level Configuration

  1. Base Resources (dev/parm/config/gfs/config.resources)
  2. Platform Overrides (dev/parm/config/gfs/config.resources.${MACHINE})
  3. Environment Binding (env/${MACHINE}.env)

Resource Variables

Variable Description Example
ntasks Total MPI tasks 480
tasks_per_node Tasks per compute node 40
max_tasks_per_node Maximum tasks per node (hardware) 40
threads_per_task OpenMP threads per MPI task 1
memory Memory requirement "96GB"

APRUN Variables by Job Step

Step APRUN Variable Typical Configuration
fcst APRUN_UFS ${launcher} -n ${ufs_ntasks}
anal APRUN_GSI ${APRUN_default} --cpus-per-task=${NTHREADS_GSI}
eupd APRUN_ENKF ${launcher} -n ${ntasks_enkf} --cpus-per-task=${NTHREADS_ENKF}
upp APRUN_UPP ${APRUN_default} --cpus-per-task=${NTHREADS_UPP}
CFP jobs APRUNCFP ${launcher} -n $ncmd ${mpmd_opt}

Example: Hera C384 Analysis Configuration

# config.resources.HERA
case ${step} in
  "anal")
    export ntasks=270
    export threads_per_task=8
    export tasks_per_node=$(( max_tasks_per_node / threads_per_task ))
    ;;
esac

# HERA.env
export NTHREADS_GSI=${NTHREADSmax}
export APRUN_GSI="${APRUN_default} --cpus-per-task=${NTHREADS_GSI}"
# Result: srun -l --export=ALL --hint=nomultithread -n 270 --cpus-per-task=8

MPMD Use Cases in Global-Workflow

Scripts Using MPMD

Script MPMD Purpose Parallel Tasks
exglobal_atmos_analysis.sh GSI observation processing Multiple obs types
exglobal_diag.sh Diagnostic file generation Parallel file I/O
exgdas_atmos_chgres_forenkf.sh EnKF regridding Per-member regrid
exglobal_atmos_ensstat.sh Ensemble statistics Parallel stat calc
exgfs_wave_init.sh Wave grid definition Multiple wave grids
exgfs_wave_prep.sh Wave boundary prep Current generation
exgfs_wave_post_gridded_sbs.sh Wave post-processing Grid output
exgfs_wave_post_pnt.sh Point output processing Spectral extraction
exglobal_atmos_products.sh GRIB2 products Parallel wgrib2

Example: Wave Initialization MPMD

# exgfs_wave_init.sh
for grdID in ${wavegrids}; do
    echo "${USHgfs}/wave_grid_moddef.sh ${grdID}" >> mpmd_script
done

"${USHgfs}/run_mpmd.sh" "${DATA}/mpmd_script"

Example: Atmospheric Products MPMD

# exglobal_atmos_products.sh
export USE_CFP="YES"

# Break tmpfile into processor-specific chunks
split -n "l/${nprocs}" -d "${tmpfile}" "${DATA}/cmdfile_"

# Run with MPMD
"${USHgfs}/run_mpmd.sh" "${DATA}/cmdfile"

Network Fabric and Interconnect

Platform Network Fabric Bandwidth MPI Transport
WCOSS2 HPE Slingshot 200 Gb/s Cray MPICH/libfabric
Hera InfiniBand HDR 200 Gb/s Intel MPI/Verbs
Hercules InfiniBand 100 Gb/s Intel MPI/Verbs
Orion InfiniBand 100 Gb/s Intel MPI/Verbs
Gaea C5/C6 Slingshot 11 200 Gb/s Cray MPICH/CXI
AWS PW EFA 100-400 Gb/s PMI2/libfabric

Filesystem Integration

Platform Parallel Filesystem MPI I/O Tuning
WCOSS2 GPFS romio_cb_write=enable
Hera Lustre I_MPI_EXTRA_FILESYSTEM_LIST=lustre
Hercules Lustre I_MPI_EXTRA_FILESYSTEM_LIST=lustre
Gaea Lustre FI_CXI_RX_MATCH_MODE=hybrid

MPI Tuning Parameters

Intel MPI (Hera, Hercules, Orion)

# GSI-specific bug workaround
export I_MPI_ADJUST_ALLREDUCE=5

# MKL threading
export MKL_NUM_THREADS=4
export MKL_CBWR=AUTO

# Buffer tuning
export MPI_BUFS_PER_PROC=2048
export MPI_BUFS_PER_HOST=2048
export MPI_GROUP_MAX=256

Cray MPICH (WCOSS2, Gaea)

# Thread placement
export OMP_PLACES=cores
export OMP_STACKSIZE=1G

# libfabric tuning
export FI_OFI_RXM_SAR_LIMIT=3145728
export FI_OFI_RXM_RX_SIZE=40000
export FI_OFI_RXM_TX_SIZE=40000

# Collective I/O
export MPICH_MPIIO_HINTS="*:romio_cb_write=enable"

Stack Size Configuration

# All platforms
ulimit -s unlimited

# OpenMP stack
export OMP_STACKSIZE=2048000   # Hera/Hercules
export OMP_STACKSIZE=1G        # WCOSS2
export OMP_STACKSIZE=2048M     # Forecast jobs

Command File Format Transformation

Input Format (Generic)

/path/to/command1 arg1 arg2
/path/to/command2 arg1 arg2
/path/to/command3 arg1 arg2

Slurm --multi-prog Format

0 /path/to/command1 arg1 arg2
1 /path/to/command2 arg1 arg2
2 /path/to/command3 arg1 arg2

Each line prefix indicates the MPI rank that executes that command.

PBS cfp Format

#!/bin/bash
/path/to/command1 arg1 arg2 > mpmd.0.out 2>&1
/path/to/command2 arg1 arg2 > mpmd.1.out 2>&1
/path/to/command3 arg1 arg2 > mpmd.2.out 2>&1

Output is redirected per-rank for debugging.


Troubleshooting Guide

Common Issues

1. MPMD Tasks Not Running in Parallel

Symptom: Tasks execute serially despite MPMD configuration

Check:

echo "USE_CFP=${USE_CFP}"
echo "launcher=${launcher}"

Solution: Ensure USE_CFP=YES is set and launcher is properly configured.

2. Task Failures with No Output

Symptom: MPMD job fails but no error messages

Check: Look for per-rank output files:

ls -la mpmd.*.out
cat mpmd.0.out

3. Memory Errors on MPMD Jobs

Symptom: OOM killer or memory allocation failures

Solution: Reduce tasks per node:

export tasks_per_node=20  # Instead of 40

4. Slurm Multi-Prog Syntax Errors

Symptom: Invalid job configuration

Check: Ensure command file has no blank lines and correct format:

cat ${DATA}/mpmd_cmdfile

5. CFP Not Found on WCOSS2

Symptom: cfp: command not found

Solution: Load the CFP module:

module load cfp

Debug Mode

Enable verbose output:

export MPMD_DEBUG=1
set -x
"${USHgfs}/run_mpmd.sh" "${DATA}/cmdfile"

References


Appendix A: Complete Environment File Reference

HERA.env Key Settings

export launcher="srun -l --export=ALL --hint=nomultithread"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000

WCOSS2.env Key Settings

export launcher="mpiexec -l"
export mpmd_opt="--cpu-bind verbose,core cfp"

GAEAC6.env Key Settings

export launcher="srun -l --export=ALL --distribution=block:block"
export mpmd_opt="--multi-prog --output=mpmd.%j.%t.out"
export FI_CXI_RX_MATCH_MODE=hybrid

Appendix B: Host YAML Configuration

hera.yaml

SCHEDULER: slurm
QUEUE: batch
PARTITION_BATCH: hera
PARTITION_SERVICE: service
SUPPORTED_RESOLUTIONS: ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48']

wcoss2.yaml

SCHEDULER: pbspro
QUEUE: 'dev'
QUEUE_SERVICE: 'dev_transfer'
SUPPORTED_RESOLUTIONS: ['C1152', 'C768', 'C384', 'C192', 'C96', 'C48']

Document generated by MCP/RAG Analysis System
Last updated: January 30, 2026