C96C48mx500_S2SW_cyc_gfs atmos_prod_f102_WRITE_ERROR - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

C96C48mx500_S2SW_cyc_gfs - JGLOBAL_ATMOS_PRODUCTS f102 Write Error Analysis

Test Case: C96C48mx500_S2SW_cyc_gfs
Job: JGLOBAL_ATMOS_PRODUCTS (forecast hour f102)
Platform: Hera (RDHPCS)
Node: h9c53
Slurm Job ID: 21785517
Build: nightly_0_4a26a69c_8508
Failure Time: Tue Feb 10 06:43:36 UTC 2026
Analysis Date: February 10, 2026
Status: FATAL ERROR - wgrib2 write failure
Analyzed By: EIB MCP-RAG Toolset (42-tool analysis)


Executive Summary

The JGLOBAL_ATMOS_PRODUCTS job for forecast hour f102 in the C96C48mx500_S2SW_cyc_gfs nightly CI test failed with a *** FATAL ERROR: write error *** emitted by wgrib2 during GRIB2 record extraction from the master file. The failure occurred at approximately pressure level 15 mb (record 161 of ~1300+) while filtering the master GRIB2 file through paramlist "a". A secondary concern is a wgrib2 spack-stack version mismatch between the loaded modules (spack-stack-1.9.2) and the wgrib2 binary actually invoked (spack-stack-1.6.0). Additionally, the setpdy.sh utility failed to find the COMROOT date file, though this was handled as non-fatal.

Root Cause Assessment: Filesystem write failure on /scratch3 — most likely caused by disk quota exhaustion, filesystem capacity limits, or a transient I/O error on the compute node. The wgrib2 version mismatch is a contributing risk factor.


Failure Chain

JGLOBAL_ATMOS_PRODUCTS
  └── jjob_header.sh
        ├── setpdy.sh → WARN: COMROOT/date/t00z not found (non-fatal)
        ├── source config.base → OK
        ├── source config.atmos_products → OK
        └── source HERA.env → OK (USE_CFP=YES)
  └── exglobal_atmos_products.sh
        ├── ${WGRIB2} -s MASTER_FILE > idx → OK
        ├── nset=1 (paramlist "a")
        │     └── ${WGRIB2} MASTER_FILE | grep -F -f paramlist.a | ${WGRIB2} -i -grib tmpfilea_f102 MASTER_FILE
        │           ├── Records 1-160: OK (reached 15 mb level)
        │           ├── Record 161 (UGRD:15 mb): *** FATAL ERROR: write error ***
        │           └── err=8
        └── err_exit "FATAL ERROR: wgrib2 failed to create intermediate grib2 file..."
              └── ABNORMAL EXIT → Job CANCELLED (SIGNAL Terminated)

Detailed Analysis

1. Primary Failure: wgrib2 Write Error

What happened:
The exglobal_atmos_products.sh script (line 64) constructs a pipeline that:

  1. Lists all records in the master GRIB2 file with ${WGRIB2} "${MASTER_FILE}"
  2. Filters records matching patterns in gfs.fFFF.paramlist.a.txt with grep -F -f
  3. Extracts matched records with ${WGRIB2} -i -grib "${tmpfile}" "${MASTER_FILE}"

At record 161 (UGRD at 15 mb, 102-hour forecast), the output wgrib2 received a write error when writing to tmpfilea_f102. The output file reached only 5,521,408 bytes (~5.3 MB) before the failure, as shown in the directory listing at exit.

Error output (line 1086 of log):

*** FATAL ERROR: write error ***

Exit code: err=8 (wgrib2 non-zero return code)

Master file:

.../COMROOT/C96C48mx500_S2SW_cyc_gfs_4a26a69c-8508/gfs.20211221/00/model/atmos/master/gfs.t00z.master.f102.grib2

Paramlist file:

.../global-workflow/parm/product/gfs.fFFF.paramlist.a.txt

2. Secondary Issue: setpdy.sh File Not Found

What happened (log lines 110, 113):

sed: can't read .../RUNTESTS/COMROOT/date/t00z: No such file or directory
.../jjob_header.sh: line 96: ./PDY: No such file or directory

The setpdy.sh utility expects a date reference file at ${COMROOT}/date/t00z, which was missing. The jjob_header.sh script uses || true to swallow these errors (lines 95-96), so the job continued. The PDY variable was already set from the environment (PDY=20211221), so this did not contribute to the failure — but it indicates the CI COMROOT directory structure may be incomplete for this test case.

3. wgrib2 Spack-Stack Version Mismatch

Loaded modules: spack-stack 1.9.2 (intel-oneapi 2024.2.1)
Actual wgrib2 binary invoked:

/contrib/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/wgrib2-2.0.8-nauzcdx/bin/wgrib2

The module environment loaded spack-stack-1.9.2 modules (81 modules visible in the log), but the wgrib2 binary resolved to spack-stack 1.6.0 (compiled with intel/2021.5.0). This cross-version linkage could cause:

  • ABI incompatibility with other loaded libraries
  • Unexpected behavior with newer GRIB2 features
  • Library path conflicts

How this happens: The load_modules.sh script (line 88) does module load "wgrib2/${wgrib2_ver}" where wgrib2_ver comes from versions/run.ver. On Hera, the gw_run.hera module should load wgrib2 from the correct spack-stack. The actual resolution depends on MODULEPATH ordering — the 1.6.0 path may be taking precedence.

4. Configuration Context

From the log's config.base trace:

Parameter Value Relevance
APP S2SW Sea-to-Subseasonal Waves (coupled ocean/ice/wave)
CASE C96 Atmospheric resolution
CASE_ENS C48 Ensemble resolution
OCNRES 500 Ocean resolution (mx500)
MODE cycled Cycling data assimilation
FHMAX_GFS 120 Max forecast hour (f102 is within range)
FCST_SEGMENTS 0,120 Single segment, no breakpoints
FHOUT_GFS 3 GFS output frequency
FHMAX_HF_GFS 48 High-frequency output max hour
downset 2 Number of paramlist sets (a, b)
MAX_TASKS 25 Max parallel tasks for products
ntasks 24 Allocated task count
KEEPDATA NO Working directory cleaned on success

5. MCP Tool Analysis Summary

MCP Tool Finding
get_job_details JGLOBAL_ATMOS_PRODUCTS sources jjob_header.sh -e atmos_products -c "base atmos_products", calls exglobal_atmos_products.sh
describe_component exglobal_atmos_products.sh is 224 lines; uses wgrib2 pipeline to filter master GRIB2 through paramlist, splits into MPMD chunks, regids via interp_atmos_master.sh
trace_execution_path JGLOBAL_ATMOS_PRODUCTS depends on: PDY, cyc, HOMEgfs, SCRgfs, DATAROOT, DATA, RUN, KEEPDATA, pgmout, err, grid, prod_dir
find_env_dependencies(WGRIB2) 19 scripts depend on WGRIB2; exported by load_modules.sh as wgrib2 command name
find_env_dependencies(FHMAX_GFS) 7 scripts depend on FHMAX_GFS; f102 < 120 so within valid range
search_documentation EE2 standards require FATAL ERROR prefix (compliant), descriptive messages (compliant), recovery capability for jobs >15min
get_operational_guidance EE2 mandates: working dirs with unique job IDs (met: .../atmos_products_f102.1435126), MPMD child processes in subdirs
search_issues Issue #4348: "Transient instability on Hercules" — similar C96C48mx500 nightly failures, but NaN-related not write errors
search_issues Issue #3630: "Long runtimes in gfs_gempak_f jobs on WCOSS2" — walltime issues in post-processing, separate root cause
get_code_context exglobal_atmos_products.sh sets FLUX_FILE, depends on err variable; 2-hop graph shows connections to wave processing, radmon scripts
find_related_files interp_atmos_master.sh uses product_functions.sh for RH trimming and sea-ice fixes; depends on same WGRIB2 env var

Root Cause Assessment

Most Likely: Filesystem Space/Quota Exhaustion on /scratch3

The *** FATAL ERROR: write error *** from wgrib2 is the standard error when the output file descriptor returns an error on write(). On HPC systems, this almost always indicates:

  1. Disk quota exceeded — The /scratch3/NCEPDEV/stmp/role.glopara filesystem hit its allocation limit. The CI pipeline writes forecast output, restart files, and post-processed products for multiple test cases simultaneously. The C96C48mx500_S2SW_cyc_gfs case generates substantial data due to coupled ocean/ice/wave components.

  2. Filesystem capacity — Lustre /scratch3 on Hera may have been under heavy load from concurrent users/jobs, leading to temporary ENOSPC conditions.

  3. Transient I/O error — Node h9c53 may have experienced a Lustre client issue (OBD timeout, network hiccup to OSTs). The Slurm signal Terminated (line 1296) suggests the job was killed rather than exiting cleanly.

Less Likely but Contributing:

  1. wgrib2 version mismatch — The spack-stack-1.6.0 wgrib2 (compiled with intel/2021.5.0) running in a spack-stack-1.9.2 environment (intel-oneapi/2024.2.1) could have library conflicts causing unexpected behavior, though a clean "write error" is more indicative of filesystem issues.

  2. Pipe buffer overflow — The three-stage pipeline (wgrib2 | grep | wgrib2 -i -grib) could theoretically deadlock if the output wgrib2 stalls and fills the pipe buffer, but this would not produce a "write error" on the file output.


Recommendations

Immediate (Fix This Failure)

  1. Check /scratch3 quota usage:

    lfs quota -u role.glopara /scratch3
    df -h /scratch3
    

    If quota is near limits, purge old RUNDIRS and RUNTESTS output from previous nightly runs.

  2. Re-run the failing job — If this was a transient I/O issue, a retry will succeed. The CI framework should be configured to automatically retry atmos_products failures (EE2 standard: recovery for post-processing jobs).

  3. Verify node h9c53 health:

    sacctmg show node h9c53
    scontrol show node h9c53
    

    Check for Lustre client errors in /var/log/messages on the node.

Short-Term (Prevent Recurrence)

  1. Add pre-flight disk space check in exglobal_atmos_products.sh before the wgrib2 pipeline:

    # Check available space in DATA directory
    avail_kb=$(df -k "${DATA}" | awk 'NR==2 {print $4}')
    if [ ${avail_kb} -lt 10485760 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${avail_kb}--lt-10485760-); then  # 10 GB minimum
        echo "FATAL ERROR: Insufficient disk space (${avail_kb} KB available) in ${DATA}"
        export err=99
        err_exit "Disk space check failed"
    fi
    
  2. Fix wgrib2 version alignment — Ensure the gw_run.hera.lua module file resolves wgrib2 from spack-stack-1.9.2, not 1.6.0. Check MODULEPATH ordering and the wgrib2_ver variable in versions/spack.ver (currently 3.6.0 — but the binary on Hera resolved to 2.0.8).

  3. Add retry logic for wgrib2 write failures — Since this is a post-processing job (not forecast), a transient write error should trigger a retry:

    for attempt in 1 2 3; do
        ${WGRIB2} "${MASTER_FILE}" | grep -F -f "${parmfile}" | ${WGRIB2} -i -grib "${tmpfile}" "${MASTER_FILE}" && break
        echo "WARNING: wgrib2 attempt ${attempt} failed, retrying in 10s..."
        sleep 10
    done
    

Long-Term (Systemic Improvements)

  1. Implement CI disk usage monitoring — Add a pre-pipeline step to the nightly CI that checks /scratch3 usage and sends alerts or skips low-priority test cases when space is limited.

  2. Separate forecast hour products into independent jobs — Per EE2 operational guidance: "submit a separate post-processing job for each forecast hour, so any failure for one forecast hour does not impact others." The current design already does this (each forecast hour gets its own atmos_products_fNNN job), which is compliant.

  3. Add tmpfile size validation — After the wgrib2 pipeline, verify the output file is non-empty and has a reasonable record count before proceeding:

    if [ ! -s "${tmpfile}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--s-"${tmpfile}"-); then
        err_exit "FATAL ERROR: wgrib2 output file '${tmpfile}' is empty or missing"
    fi
    actual_records=$(${WGRIB2} "${tmpfile}" | wc -l)
    if [ ${actual_records} -lt 100 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${actual_records}--lt-100-); then
        echo "WARNING: Only ${actual_records} records extracted (expected 300+)"
    fi
    

Affected Files (Execution Trace)

File Role
jobs/JGLOBAL_ATMOS_PRODUCTS J-Job wrapper — sources configs, calls ex-script
ush/jjob_header.sh Universal j-job header — sets DATA, sources PDY, configs, env
scripts/exglobal_atmos_products.sh Main ex-script — wgrib2 pipeline, MPMD dispatch, product generation
ush/interp_atmos_master.sh Per-chunk interpolation — regrid to 0p25/0p50/1p00 (NOT REACHED)
ush/product_functions.sh Helper functions for RH trimming, sea-ice fixes
ush/load_modules.sh Module loading — sets WGRIB2 env var
parm/product/gfs.fFFF.paramlist.a.txt GRIB2 variable filter list (group "a")
parm/config/gfs/config.atmos_products Job configuration (downset=2, MAX_TASKS=25)
parm/config/gfs/config.resources Resource allocation (ntasks=24, walltime=00:15:00)
env/HERA.env Hera-specific settings (USE_CFP=YES, launcher=srun)

Related Issues

  • #4348 — "Transient instability on Hercules" — C96C48mx500 nightly failures with NaN errors in ensemble forecast jobs (different root cause, same test case family)
  • #3630 — "Long runtimes in gfs_gempak_f jobs on WCOSS2" — Post-processing walltime issues (different platform, similar product generation pipeline)

Log Evidence Summary

Log Line Content Significance
110 sed: can't read .../COMROOT/date/t00z: No such file or directory setpdy.sh cannot find date file (non-fatal)
113 ./PDY: No such file or directory PDY source file missing (non-fatal, PDY from env)
912 wgrib2 -s .../gfs.t00z.master.f102.grib2 Index creation succeeds — master file is readable
922-924 `wgrib2 MASTER grep -F -f paramlist.a
1085 161:3078964:d=202 (truncated) Record 161 output truncated — write failure in progress
1086 *** FATAL ERROR: write error *** Primary failure — wgrib2 cannot write to tmpfile
1232 err=8 Exit code captured
1239 FATAL ERROR: wgrib2 failed to create intermediate grib2 file... err_exit message (EE2 compliant)
1240 ABNORMAL EXIT at Tue Feb 10 06:43:36 UTC 2026 on h9c53 Job termination
1294 tmpfilea_f102 size: 5,521,408 bytes Partial output file — only 161 of ~1300+ records written
1296 JOB 21785517 ON h9c53 CANCELLED AT 2026-02-10T06:43:36 DUE to SIGNAL Terminated Slurm terminated the job

Analysis generated using EIB MCP-RAG Server v3.6.2 (42 tools) with ChromaDB semantic search (60,404 documents) and Neo4j graph analysis (484,901 relationships). Tools used: get_job_details, describe_component, search_documentation, get_operational_guidance, trace_execution_path, find_env_dependencies, get_code_context, search_issues, explain_with_context, find_callers_callees, find_related_files.