Root Cause Analysis: C48_ATM `gfs_arch_tar_gfs_flux` Nightly CI Failure

Date: 2026-02-18 Nightly Build: nightly_0_24f56a5c_9211 Platform: Gaea C6 (gaeac6) Affected Case: C48_ATM (and 3 other cases)

The Error

module-setup.sh: line 69: /opt/cray/pe/lmod/lmod/init/bash: No such file or directory
module-setup.sh: line 71: module: command not found
load_modules.sh: line 168: module: command not found
load_modules.sh: line 190: module: command not found
FATAL ERROR: Failed to load gw_run.gaeac6

Error Log: /gpfs/f6/drsa-precip3/world-shared/global/CI/GITLAB/nightly_0_24f56a5c_9211/RUNTESTS/COMROOT/C48_ATM_24f56a5c-9211/logs/2021032312/gfs_arch_tar_gfs_flux.log

Root Cause: Transient NFS Mount Failure on DTN Nodes

The arch_tar (archive tarball) jobs run on the es cluster, dtn_f5_f6 partition (Data Transfer Nodes: dtn01–dtn58), not on the Cray c6 compute nodes. They are launched with --export=NONE --clusters=es, which strips all inherited environment variables — including the module shell function.

Execution Chain

Rocoto launches arch_tars.sh on a DTN node with --export=NONE
arch_tars.sh (line 7) sources load_modules.sh run
load_modules.sh (line 62–63) sources detect_machine.sh → module-setup.sh
module-setup.sh (line 68): module help test fails (no module function due to --export=NONE)
module-setup.sh (line 69): tries source /opt/cray/pe/lmod/lmod/init/bash — file not found

The path /opt/cray/pe/lmod/lmod/init/bash is an NFS-mounted Cray PE component. Testing on 2026-02-18 confirms it does exist on DTN nodes currently — the failure was transient, caused by a temporary NFS mount unavailability on the specific DTN node(s) assigned during the nightly run.

Key Files in the Call Chain

File	Role
`dev/job_cards/rocoto/arch_tars.sh`	Entry point: sources `load_modules.sh run`
`dev/ush/load_modules.sh`	Sources `detect_machine.sh` and `module-setup.sh`
`ush/detect_machine.sh`	Identifies platform as `gaeac6` (via `/gpfs/f6` path check)
`ush/module-setup.sh`	Lines 66–71: Gaea C6 lmod init with guard check

Relevant `module-setup.sh` Code (lines 66–71)

elif [[ ${MACHINE_ID} = gaeac6 ]]; then
    # We are on GAEA C6.
    if (! eval module help > /dev/null 2>&1); then
        source /opt/cray/pe/lmod/lmod/init/bash   # <-- FAILS when NFS is unavailable
    fi
    module reset

Scope of Impact

Affected Logs (5 total across 4 cases)

Case	Log	Retried?	Final Status
C48_ATM	`gfs_arch_tar_gfs_flux.log`	Yes (2 attempts)	FAILED — both attempts hit lmod error
C48_ATM	`gfs_arch_tar_gfsa.log`	Yes	RECOVERED — retry succeeded on different DTN
C48mx500_3DVarAOWCDA	`gfs_arch_tar_gfsa.log`	—	FAILED
C48mx500_3DVarAOWCDA	`gfs_arch_tar_gfs_pgrb2b.log`	—	FAILED
C48_S2SW	`gfs_arch_tar_ice_6hravg.log`	Yes (2 attempts)	FAILED
C96C48_hybatmsnowDA	`gdas_arch_tar_gdas_restartb.log`	—	FAILED

Unaffected Jobs

All jobs on Cray c6 compute nodes (--clusters=c6, batch partition) loaded modules successfully
C48_ATM had 35 total jobs in the cycle — only arch_tar jobs (DTN tasks) were affected
Some arch_tar jobs succeeded even in the same nightly run (hit healthy DTN nodes)

Partition Assignment from Rocoto XML

<!-- DTN tasks (arch_tar, fetch, globus) -->
<native>--export=NONE --clusters=es</native>
<partition>dtn_f5_f6</partition>

<!-- Compute tasks (fcst, anal, post, etc.) -->
<native>--export=NONE --clusters=c6</native>
<partition>batch</partition>

Why It's Intermittent

DTN nodes are es cluster nodes, not Cray PE compute nodes
/opt/cray/pe is NFS-mounted from the Cray PE infrastructure onto DTN nodes
When NFS is temporarily unavailable (stale mount, network flap, etc.), the lmod init file appears missing
Different DTN nodes (dtn01–dtn58) may have different mount states at any given time
--export=NONE forces every job to re-source lmod from disk (no inherited module function)
Retries may land on a different DTN node where NFS is healthy, explaining partial recovery

Verification (2026-02-18)

# DTN node confirms path exists NOW (transient issue has resolved)
$ srun --clusters=es -p dtn_f5_f6 -n1 bash -c 'ls -la /opt/cray/pe/lmod/lmod/init/bash'
-rwxr-xr-x 1 root root 5636 Jan 17  2025 /opt/cray/pe/lmod/lmod/init/bash

# DTN nodes are SLES 15 SP6, same as login nodes
$ srun --clusters=es -p dtn_f5_f6 -n1 bash -c 'hostname; cat /etc/os-release | head -2'
dtn04
NAME="SLES"
VERSION="15-SP6"

Recommendations

1. Infrastructure: Report to NCRC Admins (Short-term)

Report the transient NFS mount issue to NCRC/Gaea system administrators. DTN nodes losing access to /opt/cray/pe is an infrastructure problem that affects any job requiring modules on those nodes.

2. Code Hardening: Retry Logic in `module-setup.sh`

Add a retry-with-delay around the lmod init source for the gaeac6 case to handle transient NFS issues:

elif [[ ${MACHINE_ID} = gaeac6 ]]; then
    # We are on GAEA C6.
    if (! eval module help > /dev/null 2>&1); then
        local _lmod_init="/opt/cray/pe/lmod/lmod/init/bash"
        local _retries=3
        for (( _i=1; _i<=_retries; _i++ )); do
            if [[ -f "${_lmod_init}" ]]; then
                source "${_lmod_init}" && break
            fi
            echo "WARNING: lmod init not found (attempt ${_i}/${_retries}), waiting 5s..." >&2
            sleep 5
        done
    fi
    module reset

3. Increase `maxtries` for DTN Tasks

The current retry count may not be sufficient for transient infrastructure issues. Consider setting a higher retry count specifically for arch_tar metatasks in the Rocoto XML, since these are the only tasks running on DTN nodes.

4. Pre-flight Mount Check in `arch_tars.sh`

Before loading modules, verify the filesystem mount is healthy:

# Wait for /opt/cray/pe to be available (transient NFS issue workaround)
for i in 1 2 3; do
    [[ -d /opt/cray/pe ]] && break
    echo "WARNING: /opt/cray/pe not mounted, retrying in 10s (attempt $i/3)..." >&2
    sleep 10
done

5. Long-term: Evaluate Module Necessity on DTN

Consider whether DTN arch_tar jobs truly need a full module environment. If the jobs only perform file operations (tar, htar, hsi), a minimal environment without lmod could suffice, bypassing the issue entirely.

Conclusion

This is not a code bug — it is a transient Gaea infrastructure issue where NFS-mounted Cray PE paths become temporarily unavailable on DTN nodes. The failure is non-deterministic, affecting only arch_tar jobs that run on the dtn_f5_f6 partition. Code hardening (retry logic) and infrastructure reporting are the recommended mitigations.

C48_ATM_fail - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Root Cause Analysis: C48_ATM `gfs_arch_tar_gfs_flux` Nightly CI Failure

The Error

Root Cause: Transient NFS Mount Failure on DTN Nodes

Execution Chain

Key Files in the Call Chain

Relevant `module-setup.sh` Code (lines 66–71)

Scope of Impact

Affected Logs (5 total across 4 cases)

Unaffected Jobs

Partition Assignment from Rocoto XML

Why It's Intermittent

Verification (2026-02-18)

Recommendations

1. Infrastructure: Report to NCRC Admins (Short-term)

2. Code Hardening: Retry Logic in `module-setup.sh`

3. Increase `maxtries` for DTN Tasks

4. Pre-flight Mount Check in `arch_tars.sh`

5. Long-term: Evaluate Module Necessity on DTN

Conclusion

⚠️ GitHub.com Fallback ⚠️

C48_ATM_fail - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Root Cause Analysis: C48_ATM gfs_arch_tar_gfs_flux Nightly CI Failure

The Error

Root Cause: Transient NFS Mount Failure on DTN Nodes

Execution Chain

Key Files in the Call Chain

Relevant module-setup.sh Code (lines 66–71)

Scope of Impact

Affected Logs (5 total across 4 cases)

Unaffected Jobs

Partition Assignment from Rocoto XML

Why It's Intermittent

Verification (2026-02-18)

Recommendations

1. Infrastructure: Report to NCRC Admins (Short-term)

2. Code Hardening: Retry Logic in module-setup.sh

3. Increase maxtries for DTN Tasks

4. Pre-flight Mount Check in arch_tars.sh

5. Long-term: Evaluate Module Necessity on DTN

Conclusion

⚠️ **GitHub.com Fallback** ⚠️

Root Cause Analysis: C48_ATM `gfs_arch_tar_gfs_flux` Nightly CI Failure

Relevant `module-setup.sh` Code (lines 66–71)

2. Code Hardening: Retry Logic in `module-setup.sh`

3. Increase `maxtries` for DTN Tasks

4. Pre-flight Mount Check in `arch_tars.sh`

⚠️ GitHub.com Fallback ⚠️