C48_ATM_fail - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Root Cause Analysis: C48_ATM gfs_arch_tar_gfs_flux Nightly CI Failure

Date: 2026-02-18 Nightly Build: nightly_0_24f56a5c_9211 Platform: Gaea C6 (gaeac6) Affected Case: C48_ATM (and 3 other cases)


The Error

module-setup.sh: line 69: /opt/cray/pe/lmod/lmod/init/bash: No such file or directory
module-setup.sh: line 71: module: command not found
load_modules.sh: line 168: module: command not found
load_modules.sh: line 190: module: command not found
FATAL ERROR: Failed to load gw_run.gaeac6

Error Log: /gpfs/f6/drsa-precip3/world-shared/global/CI/GITLAB/nightly_0_24f56a5c_9211/RUNTESTS/COMROOT/C48_ATM_24f56a5c-9211/logs/2021032312/gfs_arch_tar_gfs_flux.log


Root Cause: Transient NFS Mount Failure on DTN Nodes

The arch_tar (archive tarball) jobs run on the es cluster, dtn_f5_f6 partition (Data Transfer Nodes: dtn01dtn58), not on the Cray c6 compute nodes. They are launched with --export=NONE --clusters=es, which strips all inherited environment variables — including the module shell function.

Execution Chain

  1. Rocoto launches arch_tars.sh on a DTN node with --export=NONE
  2. arch_tars.sh (line 7) sources load_modules.sh run
  3. load_modules.sh (line 62–63) sources detect_machine.shmodule-setup.sh
  4. module-setup.sh (line 68): module help test fails (no module function due to --export=NONE)
  5. module-setup.sh (line 69): tries source /opt/cray/pe/lmod/lmod/init/bashfile not found

The path /opt/cray/pe/lmod/lmod/init/bash is an NFS-mounted Cray PE component. Testing on 2026-02-18 confirms it does exist on DTN nodes currently — the failure was transient, caused by a temporary NFS mount unavailability on the specific DTN node(s) assigned during the nightly run.

Key Files in the Call Chain

File Role
dev/job_cards/rocoto/arch_tars.sh Entry point: sources load_modules.sh run
dev/ush/load_modules.sh Sources detect_machine.sh and module-setup.sh
ush/detect_machine.sh Identifies platform as gaeac6 (via /gpfs/f6 path check)
ush/module-setup.sh Lines 66–71: Gaea C6 lmod init with guard check

Relevant module-setup.sh Code (lines 66–71)

elif [[ ${MACHINE_ID} = gaeac6 ]]; then
    # We are on GAEA C6.
    if (! eval module help > /dev/null 2>&1); then
        source /opt/cray/pe/lmod/lmod/init/bash   # <-- FAILS when NFS is unavailable
    fi
    module reset

Scope of Impact

Affected Logs (5 total across 4 cases)

Case Log Retried? Final Status
C48_ATM gfs_arch_tar_gfs_flux.log Yes (2 attempts) FAILED — both attempts hit lmod error
C48_ATM gfs_arch_tar_gfsa.log Yes RECOVERED — retry succeeded on different DTN
C48mx500_3DVarAOWCDA gfs_arch_tar_gfsa.log FAILED
C48mx500_3DVarAOWCDA gfs_arch_tar_gfs_pgrb2b.log FAILED
C48_S2SW gfs_arch_tar_ice_6hravg.log Yes (2 attempts) FAILED
C96C48_hybatmsnowDA gdas_arch_tar_gdas_restartb.log FAILED

Unaffected Jobs

  • All jobs on Cray c6 compute nodes (--clusters=c6, batch partition) loaded modules successfully
  • C48_ATM had 35 total jobs in the cycle — only arch_tar jobs (DTN tasks) were affected
  • Some arch_tar jobs succeeded even in the same nightly run (hit healthy DTN nodes)

Partition Assignment from Rocoto XML

<!-- DTN tasks (arch_tar, fetch, globus) -->
<native>--export=NONE --clusters=es</native>
<partition>dtn_f5_f6</partition>

<!-- Compute tasks (fcst, anal, post, etc.) -->
<native>--export=NONE --clusters=c6</native>
<partition>batch</partition>

Why It's Intermittent

  1. DTN nodes are es cluster nodes, not Cray PE compute nodes
  2. /opt/cray/pe is NFS-mounted from the Cray PE infrastructure onto DTN nodes
  3. When NFS is temporarily unavailable (stale mount, network flap, etc.), the lmod init file appears missing
  4. Different DTN nodes (dtn01dtn58) may have different mount states at any given time
  5. --export=NONE forces every job to re-source lmod from disk (no inherited module function)
  6. Retries may land on a different DTN node where NFS is healthy, explaining partial recovery

Verification (2026-02-18)

# DTN node confirms path exists NOW (transient issue has resolved)
$ srun --clusters=es -p dtn_f5_f6 -n1 bash -c 'ls -la /opt/cray/pe/lmod/lmod/init/bash'
-rwxr-xr-x 1 root root 5636 Jan 17  2025 /opt/cray/pe/lmod/lmod/init/bash

# DTN nodes are SLES 15 SP6, same as login nodes
$ srun --clusters=es -p dtn_f5_f6 -n1 bash -c 'hostname; cat /etc/os-release | head -2'
dtn04
NAME="SLES"
VERSION="15-SP6"

Recommendations

1. Infrastructure: Report to NCRC Admins (Short-term)

Report the transient NFS mount issue to NCRC/Gaea system administrators. DTN nodes losing access to /opt/cray/pe is an infrastructure problem that affects any job requiring modules on those nodes.

2. Code Hardening: Retry Logic in module-setup.sh

Add a retry-with-delay around the lmod init source for the gaeac6 case to handle transient NFS issues:

elif [[ ${MACHINE_ID} = gaeac6 ]]; then
    # We are on GAEA C6.
    if (! eval module help > /dev/null 2>&1); then
        local _lmod_init="/opt/cray/pe/lmod/lmod/init/bash"
        local _retries=3
        for (( _i=1; _i<=_retries; _i++ )); do
            if [[ -f "${_lmod_init}" ]]; then
                source "${_lmod_init}" && break
            fi
            echo "WARNING: lmod init not found (attempt ${_i}/${_retries}), waiting 5s..." >&2
            sleep 5
        done
    fi
    module reset

3. Increase maxtries for DTN Tasks

The current retry count may not be sufficient for transient infrastructure issues. Consider setting a higher retry count specifically for arch_tar metatasks in the Rocoto XML, since these are the only tasks running on DTN nodes.

4. Pre-flight Mount Check in arch_tars.sh

Before loading modules, verify the filesystem mount is healthy:

# Wait for /opt/cray/pe to be available (transient NFS issue workaround)
for i in 1 2 3; do
    [[ -d /opt/cray/pe ]] && break
    echo "WARNING: /opt/cray/pe not mounted, retrying in 10s (attempt $i/3)..." >&2
    sleep 10
done

5. Long-term: Evaluate Module Necessity on DTN

Consider whether DTN arch_tar jobs truly need a full module environment. If the jobs only perform file operations (tar, htar, hsi), a minimal environment without lmod could suffice, bypassing the issue entirely.


Conclusion

This is not a code bug — it is a transient Gaea infrastructure issue where NFS-mounted Cray PE paths become temporarily unavailable on DTN nodes. The failure is non-deterministic, affecting only arch_tar jobs that run on the dtn_f5_f6 partition. Code hardening (retry logic) and infrastructure reporting are the recommended mitigations.

⚠️ **GitHub.com Fallback** ⚠️