C96C48mx500_S2SW_cyc_gfs atmos_prod_f102_WRITE_ERROR - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
C96C48mx500_S2SW_cyc_gfs - JGLOBAL_ATMOS_PRODUCTS f102 Write Error Analysis
Test Case: C96C48mx500_S2SW_cyc_gfs
Job: JGLOBAL_ATMOS_PRODUCTS (forecast hour f102)
Platform: Hera (RDHPCS)
Node: h9c53
Slurm Job ID: 21785517
Build: nightly_0_4a26a69c_8508
Failure Time: Tue Feb 10 06:43:36 UTC 2026
Analysis Date: February 10, 2026
Status: FATAL ERROR - wgrib2 write failure
Analyzed By: EIB MCP-RAG Toolset (42-tool analysis)
Executive Summary
The JGLOBAL_ATMOS_PRODUCTS job for forecast hour f102 in the C96C48mx500_S2SW_cyc_gfs nightly CI test failed with a *** FATAL ERROR: write error *** emitted by wgrib2 during GRIB2 record extraction from the master file. The failure occurred at approximately pressure level 15 mb (record 161 of ~1300+) while filtering the master GRIB2 file through paramlist "a". A secondary concern is a wgrib2 spack-stack version mismatch between the loaded modules (spack-stack-1.9.2) and the wgrib2 binary actually invoked (spack-stack-1.6.0). Additionally, the setpdy.sh utility failed to find the COMROOT date file, though this was handled as non-fatal.
Root Cause Assessment: Filesystem write failure on /scratch3 — most likely caused by disk quota exhaustion, filesystem capacity limits, or a transient I/O error on the compute node. The wgrib2 version mismatch is a contributing risk factor.
Failure Chain
JGLOBAL_ATMOS_PRODUCTS
└── jjob_header.sh
├── setpdy.sh → WARN: COMROOT/date/t00z not found (non-fatal)
├── source config.base → OK
├── source config.atmos_products → OK
└── source HERA.env → OK (USE_CFP=YES)
└── exglobal_atmos_products.sh
├── ${WGRIB2} -s MASTER_FILE > idx → OK
├── nset=1 (paramlist "a")
│ └── ${WGRIB2} MASTER_FILE | grep -F -f paramlist.a | ${WGRIB2} -i -grib tmpfilea_f102 MASTER_FILE
│ ├── Records 1-160: OK (reached 15 mb level)
│ ├── Record 161 (UGRD:15 mb): *** FATAL ERROR: write error ***
│ └── err=8
└── err_exit "FATAL ERROR: wgrib2 failed to create intermediate grib2 file..."
└── ABNORMAL EXIT → Job CANCELLED (SIGNAL Terminated)
Detailed Analysis
1. Primary Failure: wgrib2 Write Error
What happened:
The exglobal_atmos_products.sh script (line 64) constructs a pipeline that:
- Lists all records in the master GRIB2 file with
${WGRIB2} "${MASTER_FILE}" - Filters records matching patterns in
gfs.fFFF.paramlist.a.txtwithgrep -F -f - Extracts matched records with
${WGRIB2} -i -grib "${tmpfile}" "${MASTER_FILE}"
At record 161 (UGRD at 15 mb, 102-hour forecast), the output wgrib2 received a write error when writing to tmpfilea_f102. The output file reached only 5,521,408 bytes (~5.3 MB) before the failure, as shown in the directory listing at exit.
Error output (line 1086 of log):
*** FATAL ERROR: write error ***
Exit code: err=8 (wgrib2 non-zero return code)
Master file:
.../COMROOT/C96C48mx500_S2SW_cyc_gfs_4a26a69c-8508/gfs.20211221/00/model/atmos/master/gfs.t00z.master.f102.grib2
Paramlist file:
.../global-workflow/parm/product/gfs.fFFF.paramlist.a.txt
2. Secondary Issue: setpdy.sh File Not Found
What happened (log lines 110, 113):
sed: can't read .../RUNTESTS/COMROOT/date/t00z: No such file or directory
.../jjob_header.sh: line 96: ./PDY: No such file or directory
The setpdy.sh utility expects a date reference file at ${COMROOT}/date/t00z, which was missing. The jjob_header.sh script uses || true to swallow these errors (lines 95-96), so the job continued. The PDY variable was already set from the environment (PDY=20211221), so this did not contribute to the failure — but it indicates the CI COMROOT directory structure may be incomplete for this test case.
3. wgrib2 Spack-Stack Version Mismatch
Loaded modules: spack-stack 1.9.2 (intel-oneapi 2024.2.1)
Actual wgrib2 binary invoked:
/contrib/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/wgrib2-2.0.8-nauzcdx/bin/wgrib2
The module environment loaded spack-stack-1.9.2 modules (81 modules visible in the log), but the wgrib2 binary resolved to spack-stack 1.6.0 (compiled with intel/2021.5.0). This cross-version linkage could cause:
- ABI incompatibility with other loaded libraries
- Unexpected behavior with newer GRIB2 features
- Library path conflicts
How this happens: The load_modules.sh script (line 88) does module load "wgrib2/${wgrib2_ver}" where wgrib2_ver comes from versions/run.ver. On Hera, the gw_run.hera module should load wgrib2 from the correct spack-stack. The actual resolution depends on MODULEPATH ordering — the 1.6.0 path may be taking precedence.
4. Configuration Context
From the log's config.base trace:
| Parameter | Value | Relevance |
|---|---|---|
APP |
S2SW | Sea-to-Subseasonal Waves (coupled ocean/ice/wave) |
CASE |
C96 | Atmospheric resolution |
CASE_ENS |
C48 | Ensemble resolution |
OCNRES |
500 | Ocean resolution (mx500) |
MODE |
cycled | Cycling data assimilation |
FHMAX_GFS |
120 | Max forecast hour (f102 is within range) |
FCST_SEGMENTS |
0,120 | Single segment, no breakpoints |
FHOUT_GFS |
3 | GFS output frequency |
FHMAX_HF_GFS |
48 | High-frequency output max hour |
downset |
2 | Number of paramlist sets (a, b) |
MAX_TASKS |
25 | Max parallel tasks for products |
ntasks |
24 | Allocated task count |
KEEPDATA |
NO | Working directory cleaned on success |
5. MCP Tool Analysis Summary
| MCP Tool | Finding |
|---|---|
get_job_details |
JGLOBAL_ATMOS_PRODUCTS sources jjob_header.sh -e atmos_products -c "base atmos_products", calls exglobal_atmos_products.sh |
describe_component |
exglobal_atmos_products.sh is 224 lines; uses wgrib2 pipeline to filter master GRIB2 through paramlist, splits into MPMD chunks, regids via interp_atmos_master.sh |
trace_execution_path |
JGLOBAL_ATMOS_PRODUCTS depends on: PDY, cyc, HOMEgfs, SCRgfs, DATAROOT, DATA, RUN, KEEPDATA, pgmout, err, grid, prod_dir |
find_env_dependencies(WGRIB2) |
19 scripts depend on WGRIB2; exported by load_modules.sh as wgrib2 command name |
find_env_dependencies(FHMAX_GFS) |
7 scripts depend on FHMAX_GFS; f102 < 120 so within valid range |
search_documentation |
EE2 standards require FATAL ERROR prefix (compliant), descriptive messages (compliant), recovery capability for jobs >15min |
get_operational_guidance |
EE2 mandates: working dirs with unique job IDs (met: .../atmos_products_f102.1435126), MPMD child processes in subdirs |
search_issues |
Issue #4348: "Transient instability on Hercules" — similar C96C48mx500 nightly failures, but NaN-related not write errors |
search_issues |
Issue #3630: "Long runtimes in gfs_gempak_f jobs on WCOSS2" — walltime issues in post-processing, separate root cause |
get_code_context |
exglobal_atmos_products.sh sets FLUX_FILE, depends on err variable; 2-hop graph shows connections to wave processing, radmon scripts |
find_related_files |
interp_atmos_master.sh uses product_functions.sh for RH trimming and sea-ice fixes; depends on same WGRIB2 env var |
Root Cause Assessment
Most Likely: Filesystem Space/Quota Exhaustion on /scratch3
The *** FATAL ERROR: write error *** from wgrib2 is the standard error when the output file descriptor returns an error on write(). On HPC systems, this almost always indicates:
-
Disk quota exceeded — The
/scratch3/NCEPDEV/stmp/role.gloparafilesystem hit its allocation limit. The CI pipeline writes forecast output, restart files, and post-processed products for multiple test cases simultaneously. The C96C48mx500_S2SW_cyc_gfs case generates substantial data due to coupled ocean/ice/wave components. -
Filesystem capacity — Lustre
/scratch3on Hera may have been under heavy load from concurrent users/jobs, leading to temporary ENOSPC conditions. -
Transient I/O error — Node h9c53 may have experienced a Lustre client issue (OBD timeout, network hiccup to OSTs). The Slurm signal
Terminated(line 1296) suggests the job was killed rather than exiting cleanly.
Less Likely but Contributing:
-
wgrib2 version mismatch — The spack-stack-1.6.0 wgrib2 (compiled with intel/2021.5.0) running in a spack-stack-1.9.2 environment (intel-oneapi/2024.2.1) could have library conflicts causing unexpected behavior, though a clean "write error" is more indicative of filesystem issues.
-
Pipe buffer overflow — The three-stage pipeline (
wgrib2 | grep | wgrib2 -i -grib) could theoretically deadlock if the output wgrib2 stalls and fills the pipe buffer, but this would not produce a "write error" on the file output.
Recommendations
Immediate (Fix This Failure)
-
Check /scratch3 quota usage:
lfs quota -u role.glopara /scratch3 df -h /scratch3If quota is near limits, purge old RUNDIRS and RUNTESTS output from previous nightly runs.
-
Re-run the failing job — If this was a transient I/O issue, a retry will succeed. The CI framework should be configured to automatically retry
atmos_productsfailures (EE2 standard: recovery for post-processing jobs). -
Verify node h9c53 health:
sacctmg show node h9c53 scontrol show node h9c53Check for Lustre client errors in
/var/log/messageson the node.
Short-Term (Prevent Recurrence)
-
Add pre-flight disk space check in
exglobal_atmos_products.shbefore the wgrib2 pipeline:# Check available space in DATA directory avail_kb=$(df -k "${DATA}" | awk 'NR==2 {print $4}') if [ ${avail_kb} -lt 10485760 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${avail_kb}--lt-10485760-); then # 10 GB minimum echo "FATAL ERROR: Insufficient disk space (${avail_kb} KB available) in ${DATA}" export err=99 err_exit "Disk space check failed" fi -
Fix wgrib2 version alignment — Ensure the
gw_run.hera.luamodule file resolves wgrib2 from spack-stack-1.9.2, not 1.6.0. CheckMODULEPATHordering and thewgrib2_vervariable inversions/spack.ver(currently3.6.0— but the binary on Hera resolved to2.0.8). -
Add retry logic for wgrib2 write failures — Since this is a post-processing job (not forecast), a transient write error should trigger a retry:
for attempt in 1 2 3; do ${WGRIB2} "${MASTER_FILE}" | grep -F -f "${parmfile}" | ${WGRIB2} -i -grib "${tmpfile}" "${MASTER_FILE}" && break echo "WARNING: wgrib2 attempt ${attempt} failed, retrying in 10s..." sleep 10 done
Long-Term (Systemic Improvements)
-
Implement CI disk usage monitoring — Add a pre-pipeline step to the nightly CI that checks
/scratch3usage and sends alerts or skips low-priority test cases when space is limited. -
Separate forecast hour products into independent jobs — Per EE2 operational guidance: "submit a separate post-processing job for each forecast hour, so any failure for one forecast hour does not impact others." The current design already does this (each forecast hour gets its own
atmos_products_fNNNjob), which is compliant. -
Add tmpfile size validation — After the wgrib2 pipeline, verify the output file is non-empty and has a reasonable record count before proceeding:
if [ ! -s "${tmpfile}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--s-"${tmpfile}"-); then err_exit "FATAL ERROR: wgrib2 output file '${tmpfile}' is empty or missing" fi actual_records=$(${WGRIB2} "${tmpfile}" | wc -l) if [ ${actual_records} -lt 100 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${actual_records}--lt-100-); then echo "WARNING: Only ${actual_records} records extracted (expected 300+)" fi
Affected Files (Execution Trace)
| File | Role |
|---|---|
jobs/JGLOBAL_ATMOS_PRODUCTS |
J-Job wrapper — sources configs, calls ex-script |
ush/jjob_header.sh |
Universal j-job header — sets DATA, sources PDY, configs, env |
scripts/exglobal_atmos_products.sh |
Main ex-script — wgrib2 pipeline, MPMD dispatch, product generation |
ush/interp_atmos_master.sh |
Per-chunk interpolation — regrid to 0p25/0p50/1p00 (NOT REACHED) |
ush/product_functions.sh |
Helper functions for RH trimming, sea-ice fixes |
ush/load_modules.sh |
Module loading — sets WGRIB2 env var |
parm/product/gfs.fFFF.paramlist.a.txt |
GRIB2 variable filter list (group "a") |
parm/config/gfs/config.atmos_products |
Job configuration (downset=2, MAX_TASKS=25) |
parm/config/gfs/config.resources |
Resource allocation (ntasks=24, walltime=00:15:00) |
env/HERA.env |
Hera-specific settings (USE_CFP=YES, launcher=srun) |
Related Issues
- #4348 — "Transient instability on Hercules" — C96C48mx500 nightly failures with NaN errors in ensemble forecast jobs (different root cause, same test case family)
- #3630 — "Long runtimes in gfs_gempak_f jobs on WCOSS2" — Post-processing walltime issues (different platform, similar product generation pipeline)
Log Evidence Summary
| Log Line | Content | Significance |
|---|---|---|
| 110 | sed: can't read .../COMROOT/date/t00z: No such file or directory |
setpdy.sh cannot find date file (non-fatal) |
| 113 | ./PDY: No such file or directory |
PDY source file missing (non-fatal, PDY from env) |
| 912 | wgrib2 -s .../gfs.t00z.master.f102.grib2 |
Index creation succeeds — master file is readable |
| 922-924 | `wgrib2 MASTER | grep -F -f paramlist.a |
| 1085 | 161:3078964:d=202 (truncated) |
Record 161 output truncated — write failure in progress |
| 1086 | *** FATAL ERROR: write error *** |
Primary failure — wgrib2 cannot write to tmpfile |
| 1232 | err=8 |
Exit code captured |
| 1239 | FATAL ERROR: wgrib2 failed to create intermediate grib2 file... |
err_exit message (EE2 compliant) |
| 1240 | ABNORMAL EXIT at Tue Feb 10 06:43:36 UTC 2026 on h9c53 |
Job termination |
| 1294 | tmpfilea_f102 size: 5,521,408 bytes |
Partial output file — only 161 of ~1300+ records written |
| 1296 | JOB 21785517 ON h9c53 CANCELLED AT 2026-02-10T06:43:36 DUE to SIGNAL Terminated |
Slurm terminated the job |
Analysis generated using EIB MCP-RAG Server v3.6.2 (42 tools) with ChromaDB semantic search (60,404 documents) and Neo4j graph analysis (484,901 relationships). Tools used: get_job_details, describe_component, search_documentation, get_operational_guidance, trace_execution_path, find_env_dependencies, get_code_context, search_issues, explain_with_context, find_callers_callees, find_related_files.