C96C48mx500_S2SW_cyc_gfs atmos_prod_f102_WRITE_ERROR - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

C96C48mx500_S2SW_cyc_gfs - JGLOBAL_ATMOS_PRODUCTS f102 Write Error Analysis

Test Case: C96C48mx500_S2SW_cyc_gfs
Job: JGLOBAL_ATMOS_PRODUCTS (forecast hour f102)
Platform: Hera (RDHPCS)
Node: h9c53
Slurm Job ID: 21785517
Build: nightly_0_4a26a69c_8508
Failure Time: Tue Feb 10 06:43:36 UTC 2026
Analysis Date: February 10, 2026
Status: FATAL ERROR - wgrib2 write failure
Analyzed By: EIB MCP-RAG Toolset (42-tool analysis)

Executive Summary

The JGLOBAL_ATMOS_PRODUCTS job for forecast hour f102 in the C96C48mx500_S2SW_cyc_gfs nightly CI test failed with a *** FATAL ERROR: write error *** emitted by wgrib2 during GRIB2 record extraction from the master file. The failure occurred at approximately pressure level 15 mb (record 161 of ~1300+) while filtering the master GRIB2 file through paramlist "a". A secondary concern is a wgrib2 spack-stack version mismatch between the loaded modules (spack-stack-1.9.2) and the wgrib2 binary actually invoked (spack-stack-1.6.0). Additionally, the setpdy.sh utility failed to find the COMROOT date file, though this was handled as non-fatal.

Root Cause Assessment: Filesystem write failure on /scratch3 — most likely caused by disk quota exhaustion, filesystem capacity limits, or a transient I/O error on the compute node. The wgrib2 version mismatch is a contributing risk factor.

Failure Chain

JGLOBAL_ATMOS_PRODUCTS
  └── jjob_header.sh
        ├── setpdy.sh → WARN: COMROOT/date/t00z not found (non-fatal)
        ├── source config.base → OK
        ├── source config.atmos_products → OK
        └── source HERA.env → OK (USE_CFP=YES)
  └── exglobal_atmos_products.sh
        ├── ${WGRIB2} -s MASTER_FILE > idx → OK
        ├── nset=1 (paramlist "a")
        │     └── ${WGRIB2} MASTER_FILE | grep -F -f paramlist.a | ${WGRIB2} -i -grib tmpfilea_f102 MASTER_FILE
        │           ├── Records 1-160: OK (reached 15 mb level)
        │           ├── Record 161 (UGRD:15 mb): *** FATAL ERROR: write error ***
        │           └── err=8
        └── err_exit "FATAL ERROR: wgrib2 failed to create intermediate grib2 file..."
              └── ABNORMAL EXIT → Job CANCELLED (SIGNAL Terminated)

Detailed Analysis

1. Primary Failure: wgrib2 Write Error

What happened:
The exglobal_atmos_products.sh script (line 64) constructs a pipeline that:

Lists all records in the master GRIB2 file with ${WGRIB2} "${MASTER_FILE}"
Filters records matching patterns in gfs.fFFF.paramlist.a.txt with grep -F -f
Extracts matched records with ${WGRIB2} -i -grib "${tmpfile}" "${MASTER_FILE}"

At record 161 (UGRD at 15 mb, 102-hour forecast), the output wgrib2 received a write error when writing to tmpfilea_f102. The output file reached only 5,521,408 bytes (~5.3 MB) before the failure, as shown in the directory listing at exit.

Error output (line 1086 of log):

*** FATAL ERROR: write error ***

Exit code: err=8 (wgrib2 non-zero return code)

Master file:

.../COMROOT/C96C48mx500_S2SW_cyc_gfs_4a26a69c-8508/gfs.20211221/00/model/atmos/master/gfs.t00z.master.f102.grib2

Paramlist file:

.../global-workflow/parm/product/gfs.fFFF.paramlist.a.txt

2. Secondary Issue: setpdy.sh File Not Found

What happened (log lines 110, 113):

sed: can't read .../RUNTESTS/COMROOT/date/t00z: No such file or directory
.../jjob_header.sh: line 96: ./PDY: No such file or directory

The setpdy.sh utility expects a date reference file at ${COMROOT}/date/t00z, which was missing. The jjob_header.sh script uses || true to swallow these errors (lines 95-96), so the job continued. The PDY variable was already set from the environment (PDY=20211221), so this did not contribute to the failure — but it indicates the CI COMROOT directory structure may be incomplete for this test case.

3. wgrib2 Spack-Stack Version Mismatch

Loaded modules: spack-stack 1.9.2 (intel-oneapi 2024.2.1)
Actual wgrib2 binary invoked:

/contrib/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/wgrib2-2.0.8-nauzcdx/bin/wgrib2

The module environment loaded spack-stack-1.9.2 modules (81 modules visible in the log), but the wgrib2 binary resolved to spack-stack 1.6.0 (compiled with intel/2021.5.0). This cross-version linkage could cause:

ABI incompatibility with other loaded libraries
Unexpected behavior with newer GRIB2 features
Library path conflicts

How this happens: The load_modules.sh script (line 88) does module load "wgrib2/${wgrib2_ver}" where wgrib2_ver comes from versions/run.ver. On Hera, the gw_run.hera module should load wgrib2 from the correct spack-stack. The actual resolution depends on MODULEPATH ordering — the 1.6.0 path may be taking precedence.

4. Configuration Context

From the log's config.base trace:

Parameter	Value	Relevance
`APP`	S2SW	Sea-to-Subseasonal Waves (coupled ocean/ice/wave)
`CASE`	C96	Atmospheric resolution
`CASE_ENS`	C48	Ensemble resolution
`OCNRES`	500	Ocean resolution (mx500)
`MODE`	cycled	Cycling data assimilation
`FHMAX_GFS`	120	Max forecast hour (f102 is within range)
`FCST_SEGMENTS`	0,120	Single segment, no breakpoints
`FHOUT_GFS`	3	GFS output frequency
`FHMAX_HF_GFS`	48	High-frequency output max hour
`downset`	2	Number of paramlist sets (a, b)
`MAX_TASKS`	25	Max parallel tasks for products
`ntasks`	24	Allocated task count
`KEEPDATA`	NO	Working directory cleaned on success

5. MCP Tool Analysis Summary

MCP Tool	Finding
`get_job_details`	JGLOBAL_ATMOS_PRODUCTS sources `jjob_header.sh -e atmos_products -c "base atmos_products"`, calls `exglobal_atmos_products.sh`
`describe_component`	`exglobal_atmos_products.sh` is 224 lines; uses wgrib2 pipeline to filter master GRIB2 through paramlist, splits into MPMD chunks, regids via `interp_atmos_master.sh`
`trace_execution_path`	JGLOBAL_ATMOS_PRODUCTS depends on: PDY, cyc, HOMEgfs, SCRgfs, DATAROOT, DATA, RUN, KEEPDATA, pgmout, err, grid, prod_dir
`find_env_dependencies(WGRIB2)`	19 scripts depend on WGRIB2; exported by `load_modules.sh` as `wgrib2` command name
`find_env_dependencies(FHMAX_GFS)`	7 scripts depend on FHMAX_GFS; f102 < 120 so within valid range
`search_documentation`	EE2 standards require FATAL ERROR prefix (compliant), descriptive messages (compliant), recovery capability for jobs >15min
`get_operational_guidance`	EE2 mandates: working dirs with unique job IDs (met: `.../atmos_products_f102.1435126`), MPMD child processes in subdirs
`search_issues`	Issue #4348: "Transient instability on Hercules" — similar C96C48mx500 nightly failures, but NaN-related not write errors
`search_issues`	Issue #3630: "Long runtimes in gfs_gempak_f jobs on WCOSS2" — walltime issues in post-processing, separate root cause
`get_code_context`	`exglobal_atmos_products.sh` sets FLUX_FILE, depends on err variable; 2-hop graph shows connections to wave processing, radmon scripts
`find_related_files`	`interp_atmos_master.sh` uses `product_functions.sh` for RH trimming and sea-ice fixes; depends on same WGRIB2 env var

Root Cause Assessment

Most Likely: Filesystem Space/Quota Exhaustion on /scratch3

The *** FATAL ERROR: write error *** from wgrib2 is the standard error when the output file descriptor returns an error on write(). On HPC systems, this almost always indicates:

Disk quota exceeded — The /scratch3/NCEPDEV/stmp/role.glopara filesystem hit its allocation limit. The CI pipeline writes forecast output, restart files, and post-processed products for multiple test cases simultaneously. The C96C48mx500_S2SW_cyc_gfs case generates substantial data due to coupled ocean/ice/wave components.
Filesystem capacity — Lustre /scratch3 on Hera may have been under heavy load from concurrent users/jobs, leading to temporary ENOSPC conditions.
Transient I/O error — Node h9c53 may have experienced a Lustre client issue (OBD timeout, network hiccup to OSTs). The Slurm signal Terminated (line 1296) suggests the job was killed rather than exiting cleanly.

Less Likely but Contributing:

wgrib2 version mismatch — The spack-stack-1.6.0 wgrib2 (compiled with intel/2021.5.0) running in a spack-stack-1.9.2 environment (intel-oneapi/2024.2.1) could have library conflicts causing unexpected behavior, though a clean "write error" is more indicative of filesystem issues.
Pipe buffer overflow — The three-stage pipeline (wgrib2 | grep | wgrib2 -i -grib) could theoretically deadlock if the output wgrib2 stalls and fills the pipe buffer, but this would not produce a "write error" on the file output.

Recommendations

Immediate (Fix This Failure)

Check /scratch3 quota usage:
```
lfs quota -u role.glopara /scratch3
df -h /scratch3
```
If quota is near limits, purge old RUNDIRS and RUNTESTS output from previous nightly runs.
Re-run the failing job — If this was a transient I/O issue, a retry will succeed. The CI framework should be configured to automatically retry atmos_products failures (EE2 standard: recovery for post-processing jobs).
Verify node h9c53 health:
```
sacctmg show node h9c53
scontrol show node h9c53
```
Check for Lustre client errors in /var/log/messages on the node.

Short-Term (Prevent Recurrence)

Add pre-flight disk space check in exglobal_atmos_products.sh before the wgrib2 pipeline:

# Check available space in DATA directory
avail_kb=$(df -k "${DATA}" | awk 'NR==2 {print $4}')
if [ ${avail_kb} -lt 10485760 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${avail_kb}--lt-10485760-); then  # 10 GB minimum
    echo "FATAL ERROR: Insufficient disk space (${avail_kb} KB available) in ${DATA}"
    export err=99
    err_exit "Disk space check failed"
fi

Fix wgrib2 version alignment — Ensure the gw_run.hera.lua module file resolves wgrib2 from spack-stack-1.9.2, not 1.6.0. Check MODULEPATH ordering and the wgrib2_ver variable in versions/spack.ver (currently 3.6.0 — but the binary on Hera resolved to 2.0.8).

Add retry logic for wgrib2 write failures — Since this is a post-processing job (not forecast), a transient write error should trigger a retry:

for attempt in 1 2 3; do
    ${WGRIB2} "${MASTER_FILE}" | grep -F -f "${parmfile}" | ${WGRIB2} -i -grib "${tmpfile}" "${MASTER_FILE}" && break
    echo "WARNING: wgrib2 attempt ${attempt} failed, retrying in 10s..."
    sleep 10
done

Long-Term (Systemic Improvements)

Implement CI disk usage monitoring — Add a pre-pipeline step to the nightly CI that checks /scratch3 usage and sends alerts or skips low-priority test cases when space is limited.
Separate forecast hour products into independent jobs — Per EE2 operational guidance: "submit a separate post-processing job for each forecast hour, so any failure for one forecast hour does not impact others." The current design already does this (each forecast hour gets its own atmos_products_fNNN job), which is compliant.

Add tmpfile size validation — After the wgrib2 pipeline, verify the output file is non-empty and has a reasonable record count before proceeding:

if [ ! -s "${tmpfile}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--s-"${tmpfile}"-); then
    err_exit "FATAL ERROR: wgrib2 output file '${tmpfile}' is empty or missing"
fi
actual_records=$(${WGRIB2} "${tmpfile}" | wc -l)
if [ ${actual_records} -lt 100 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${actual_records}--lt-100-); then
    echo "WARNING: Only ${actual_records} records extracted (expected 300+)"
fi

Affected Files (Execution Trace)

File	Role
`jobs/JGLOBAL_ATMOS_PRODUCTS`	J-Job wrapper — sources configs, calls ex-script
`ush/jjob_header.sh`	Universal j-job header — sets DATA, sources PDY, configs, env
`scripts/exglobal_atmos_products.sh`	Main ex-script — wgrib2 pipeline, MPMD dispatch, product generation
`ush/interp_atmos_master.sh`	Per-chunk interpolation — regrid to 0p25/0p50/1p00 (NOT REACHED)
`ush/product_functions.sh`	Helper functions for RH trimming, sea-ice fixes
`ush/load_modules.sh`	Module loading — sets WGRIB2 env var
`parm/product/gfs.fFFF.paramlist.a.txt`	GRIB2 variable filter list (group "a")
`parm/config/gfs/config.atmos_products`	Job configuration (downset=2, MAX_TASKS=25)
`parm/config/gfs/config.resources`	Resource allocation (ntasks=24, walltime=00:15:00)
`env/HERA.env`	Hera-specific settings (USE_CFP=YES, launcher=srun)

Related Issues

#4348 — "Transient instability on Hercules" — C96C48mx500 nightly failures with NaN errors in ensemble forecast jobs (different root cause, same test case family)
#3630 — "Long runtimes in gfs_gempak_f jobs on WCOSS2" — Post-processing walltime issues (different platform, similar product generation pipeline)

Log Evidence Summary

Log Line	Content	Significance
110	`sed: can't read .../COMROOT/date/t00z: No such file or directory`	setpdy.sh cannot find date file (non-fatal)
113	`./PDY: No such file or directory`	PDY source file missing (non-fatal, PDY from env)
912	`wgrib2 -s .../gfs.t00z.master.f102.grib2`	Index creation succeeds — master file is readable
922-924	`wgrib2 MASTER	grep -F -f paramlist.a
1085	`161:3078964:d=202` (truncated)	Record 161 output truncated — write failure in progress
1086	`* FATAL ERROR: write error *`	Primary failure — wgrib2 cannot write to tmpfile
1232	`err=8`	Exit code captured
1239	`FATAL ERROR: wgrib2 failed to create intermediate grib2 file...`	err_exit message (EE2 compliant)
1240	`ABNORMAL EXIT at Tue Feb 10 06:43:36 UTC 2026 on h9c53`	Job termination
1294	`tmpfilea_f102` size: 5,521,408 bytes	Partial output file — only 161 of ~1300+ records written
1296	`JOB 21785517 ON h9c53 CANCELLED AT 2026-02-10T06:43:36 DUE to SIGNAL Terminated`	Slurm terminated the job

Analysis generated using EIB MCP-RAG Server v3.6.2 (42 tools) with ChromaDB semantic search (60,404 documents) and Neo4j graph analysis (484,901 relationships). Tools used: get_job_details, describe_component, search_documentation, get_operational_guidance, trace_execution_path, find_env_dependencies, get_code_context, search_issues, explain_with_context, find_callers_callees, find_related_files.