C48_ATM_fail - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Date: 2026-02-18
Nightly Build: nightly_0_24f56a5c_9211
Platform: Gaea C6 (gaeac6)
Affected Case: C48_ATM (and 3 other cases)
module-setup.sh: line 69: /opt/cray/pe/lmod/lmod/init/bash: No such file or directory
module-setup.sh: line 71: module: command not found
load_modules.sh: line 168: module: command not found
load_modules.sh: line 190: module: command not found
FATAL ERROR: Failed to load gw_run.gaeac6
Error Log:
/gpfs/f6/drsa-precip3/world-shared/global/CI/GITLAB/nightly_0_24f56a5c_9211/RUNTESTS/COMROOT/C48_ATM_24f56a5c-9211/logs/2021032312/gfs_arch_tar_gfs_flux.log
The arch_tar (archive tarball) jobs run on the es cluster, dtn_f5_f6 partition (Data Transfer Nodes: dtn01–dtn58), not on the Cray c6 compute nodes. They are launched with --export=NONE --clusters=es, which strips all inherited environment variables — including the module shell function.
-
Rocoto launches
arch_tars.shon a DTN node with--export=NONE -
arch_tars.sh(line 7) sourcesload_modules.sh run -
load_modules.sh(line 62–63) sourcesdetect_machine.sh→module-setup.sh -
module-setup.sh(line 68):module helptest fails (no module function due to--export=NONE) -
module-setup.sh(line 69): triessource /opt/cray/pe/lmod/lmod/init/bash— file not found
The path /opt/cray/pe/lmod/lmod/init/bash is an NFS-mounted Cray PE component. Testing on 2026-02-18 confirms it does exist on DTN nodes currently — the failure was transient, caused by a temporary NFS mount unavailability on the specific DTN node(s) assigned during the nightly run.
| File | Role |
|---|---|
dev/job_cards/rocoto/arch_tars.sh |
Entry point: sources load_modules.sh run
|
dev/ush/load_modules.sh |
Sources detect_machine.sh and module-setup.sh
|
ush/detect_machine.sh |
Identifies platform as gaeac6 (via /gpfs/f6 path check) |
ush/module-setup.sh |
Lines 66–71: Gaea C6 lmod init with guard check |
elif [[ ${MACHINE_ID} = gaeac6 ]]; then
# We are on GAEA C6.
if (! eval module help > /dev/null 2>&1); then
source /opt/cray/pe/lmod/lmod/init/bash # <-- FAILS when NFS is unavailable
fi
module reset| Case | Log | Retried? | Final Status |
|---|---|---|---|
| C48_ATM | gfs_arch_tar_gfs_flux.log |
Yes (2 attempts) | FAILED — both attempts hit lmod error |
| C48_ATM | gfs_arch_tar_gfsa.log |
Yes | RECOVERED — retry succeeded on different DTN |
| C48mx500_3DVarAOWCDA | gfs_arch_tar_gfsa.log |
— | FAILED |
| C48mx500_3DVarAOWCDA | gfs_arch_tar_gfs_pgrb2b.log |
— | FAILED |
| C48_S2SW | gfs_arch_tar_ice_6hravg.log |
Yes (2 attempts) | FAILED |
| C96C48_hybatmsnowDA | gdas_arch_tar_gdas_restartb.log |
— | FAILED |
- All jobs on Cray
c6compute nodes (--clusters=c6,batchpartition) loaded modules successfully - C48_ATM had 35 total jobs in the cycle — only
arch_tarjobs (DTN tasks) were affected - Some
arch_tarjobs succeeded even in the same nightly run (hit healthy DTN nodes)
<!-- DTN tasks (arch_tar, fetch, globus) -->
<native>--export=NONE --clusters=es</native>
<partition>dtn_f5_f6</partition>
<!-- Compute tasks (fcst, anal, post, etc.) -->
<native>--export=NONE --clusters=c6</native>
<partition>batch</partition>-
DTN nodes are
escluster nodes, not Cray PE compute nodes -
/opt/cray/peis NFS-mounted from the Cray PE infrastructure onto DTN nodes - When NFS is temporarily unavailable (stale mount, network flap, etc.), the lmod init file appears missing
- Different DTN nodes (
dtn01–dtn58) may have different mount states at any given time -
--export=NONEforces every job to re-source lmod from disk (no inheritedmodulefunction) - Retries may land on a different DTN node where NFS is healthy, explaining partial recovery
# DTN node confirms path exists NOW (transient issue has resolved)
$ srun --clusters=es -p dtn_f5_f6 -n1 bash -c 'ls -la /opt/cray/pe/lmod/lmod/init/bash'
-rwxr-xr-x 1 root root 5636 Jan 17 2025 /opt/cray/pe/lmod/lmod/init/bash
# DTN nodes are SLES 15 SP6, same as login nodes
$ srun --clusters=es -p dtn_f5_f6 -n1 bash -c 'hostname; cat /etc/os-release | head -2'
dtn04
NAME="SLES"
VERSION="15-SP6"Report the transient NFS mount issue to NCRC/Gaea system administrators. DTN nodes losing access to /opt/cray/pe is an infrastructure problem that affects any job requiring modules on those nodes.
Add a retry-with-delay around the lmod init source for the gaeac6 case to handle transient NFS issues:
elif [[ ${MACHINE_ID} = gaeac6 ]]; then
# We are on GAEA C6.
if (! eval module help > /dev/null 2>&1); then
local _lmod_init="/opt/cray/pe/lmod/lmod/init/bash"
local _retries=3
for (( _i=1; _i<=_retries; _i++ )); do
if [[ -f "${_lmod_init}" ]]; then
source "${_lmod_init}" && break
fi
echo "WARNING: lmod init not found (attempt ${_i}/${_retries}), waiting 5s..." >&2
sleep 5
done
fi
module resetThe current retry count may not be sufficient for transient infrastructure issues. Consider setting a higher retry count specifically for arch_tar metatasks in the Rocoto XML, since these are the only tasks running on DTN nodes.
Before loading modules, verify the filesystem mount is healthy:
# Wait for /opt/cray/pe to be available (transient NFS issue workaround)
for i in 1 2 3; do
[[ -d /opt/cray/pe ]] && break
echo "WARNING: /opt/cray/pe not mounted, retrying in 10s (attempt $i/3)..." >&2
sleep 10
doneConsider whether DTN arch_tar jobs truly need a full module environment. If the jobs only perform file operations (tar, htar, hsi), a minimal environment without lmod could suffice, bypassing the issue entirely.
This is not a code bug — it is a transient Gaea infrastructure issue where NFS-mounted Cray PE paths become temporarily unavailable on DTN nodes. The failure is non-deterministic, affecting only arch_tar jobs that run on the dtn_f5_f6 partition. Code hardening (retry logic) and infrastructure reporting are the recommended mitigations.