Nightly_Failure_Analysis_enkfgdas_fcst_mem001_Hercules_2026 02 18 - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Nightly Failure Analysis: enkfgdas_fcst_mem001 on Hercules

Date of Failure: 2026-02-18 22:26–22:27 CST
Date of Analysis: 2026-02-19
Analyst: EIB MCP-RAG Automated Analysis
Source Log: emcbot/enkfgdas_fcst_mem001.log


Executive Summary

The nightly CI/CD run on Hercules failed in the enkfgdas ensemble forecast (member 001) due to a MOM6 NaN numerical instability. The ocean model's mpp_reproducing_sum detected NaN values in a parallel summation field and issued a FATAL abort, killing all 80 MPI ranks. This is an instance of the known transient instability on Hercules tracked in issue #4348.


Root Cause

MOM6 NaN numerical instability β€” the ocean model detected NaN (Not a Number) values in a summation field, triggering a fatal abort:

FATAL from PE 2: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability

This originated from mpp_reproducing_sum in MOM6's FMS infrastructure. The function performs bit-reproducible parallel summation and includes NaN guards β€” when it encounters NaN values in the input array, it fatally aborts rather than propagating corrupted math.


Failure Chain

Step Detail
NaN source PE 2 detected NaN during MOM6 ocean model timestepping
FATAL abort mpp_reproducing_sum(_2d) killed the application
srun kill tasks 0-79: Killed on hercules-05-08 (all 80 MPI ranks)
Exit code err=137 (SIGKILL β€” signal 9)
err_exit The forecast failed to run to completion! RETURN CODE 137
SLURM JOB 7968733 CANCELLED DUE to SIGNAL Terminated

Key Log Lines

Line 8695:  FATAL from PE 2: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
Line 8697:  srun: error: hercules-05-08: tasks 0-79: Killed
Line 8698:  srun: Terminating StepId=7968733.0
Line 8700:  err=137
Line 8706:  -- FATAL ERROR: The forecast failed to run to completion! RETURN CODE 137
Line 8707:  -- ABNORMAL EXIT at Wed Feb 18 22:27:44 CST 2026 on hercules-05-08
Line 8854:  [2026-02-18T22:27:45.103] error: *** JOB 7968733 ON hercules-05-08 CANCELLED AT 2026-02-18T22:27:45 DUE to SIGNAL Terminated ***

Run Configuration

Parameter Value
Job enkfgdas_fcst_mem001 (EnKF GDAS ensemble forecast, member 001)
Experiment C48mx500_hybAOWCDA_3b291f69-9296
Nightly Build nightly_0_3b291f69_9296
Commit 3b291f69
Cycle 2021032500 (test case: 2021-03-25 00Z)
Coupled model S2S β€” FV3ATM (C48) + MOM6 (mx500) + CICE6 (mx500)
UFS configure ufs.configure.s2s.IN
Node hercules-05-08
SLURM Job ID 7968733 (sub-step 3283753)
Wall time used ~97 seconds (22:26:07 to 22:27:44 CST)
MPI tasks 80 ranks, 1 thread per task
Launcher srun -l --export=ALL --hint=nomultithread -n 30
Machine HERCULES
Compiler Intel oneAPI 2024.2.1
ESMF 8.8.0
RUN enkfgdas
NMEM_ENS_GFS 30

Key Paths

HOMEgfs:  /work2/noaa/global/role-global/GFS_CI_CD/HERCULES/BUILDS/GITLAB/nightly_0_3b291f69_9296/global-workflow
EXPDIR:   .../RUNTESTS/EXPDIR/C48mx500_hybAOWCDA_3b291f69-9296
ROTDIR:   .../RUNTESTS/COMROOT/C48mx500_hybAOWCDA_3b291f69-9296
DATA:     /work2/noaa/global/role-global/stmp/HERCULES/RUNDIRS/C48mx500_hybAOWCDA_3b291f69-9296/enkfgdas.2021032500/enkfgdasefcs001.2021032500/fcst.3283753

Config Sources Loaded

The forecast job sourced the following configuration chain (all loaded successfully):

  1. config.base β€” Machine: HERCULES, PSLOT: C48mx500_hybAOWCDA_3b291f69-9296
  2. config.com β€” COM path templates
  3. config.fcst β€” Forecast settings, FHMAX_HF=0, FHOUT_HF=0
  4. config.ufs β€” UFS coupling settings (S2S: FV3+MOM6+CICE6)
  5. config.ocn β€” Ocean: MESH_OCN=mesh.mx500.nc, ODA_INCUPD=True
  6. config.ice β€” Ice: min_seaice=1.0e-6
  7. config.efcs β€” EnKF forecast: CASE=C48
  8. config.nsst β€” NSST: NST_MODEL=2
  9. config.resources β€” Resources: walltime=00:20:00
  10. config.resources.HERCULES β€” Hercules-specific resource overrides
  11. HERCULES.env β€” Environment: OMP_STACKSIZE=512M, FI_MLX_INJECT_LIMIT=0

Non-Fatal Warnings (Not Causal)

Line Warning Impact
103 sed: can't read .../COMROOT/date/t00z: No such file or directory Handled by true β€” CI setup pattern
106 ./PDY: No such file or directory Handled by true β€” CI setup pattern
7385 unable to interpolate. filled with nearest point value at 42 points Normal FV3 surface field interpolation behavior
8852 cat: OUTPUT.3283811: No such file or directory Model output not written before crash

These warnings are not causal to the failure.


Known Issue Match

This failure is an exact match for the open tracking issue:

  • Filed by: @DavidHuber-NOAA on 2025-12-17
  • Status: Open (reopened)
  • Assignee: @DavidHuber-NOAA

From the issue description:

"The nightly runs are periodically reporting NaNs when running ensemble forecast jobs. Last night's (12/17/2025) nightly showed the failure on the first full cycle of the C96C48mx500_S2SW_cyc_gfs test and the night before on the C96C48_hybatmDA test. Since the pattern seems to be the ensemble forecasts in different jobs, it may suggest an issue with the eupd job."

Pattern Comparison

Attribute Issue #4348 (Dec 2025) This Failure (Feb 2026)
Platform Hercules only Hercules
Error NaN in ensemble forecasts NaN in mpp_reproducing_sum
Job type enkfgdas ensemble fcst enkfgdas_fcst_mem001
Tests affected C96C48mx500_S2SW_cyc_gfs, C96C48_hybatmDA C48mx500_hybAOWCDA
Transient Yes β€” does not reproduce consistently TBD
Suspected source eupd job upstream Same hypothesis applies

This confirms the issue is still active as of February 2026.


Possible Root Causes

  1. EnKF update (eupd) introduced bad increments β€” the issue author suspects the eupd job may be the upstream source, since different ensemble members in different tests fail
  2. MOM6 ocean state corruption β€” NaN propagation from an unstable ocean grid cell at mx500 resolution
  3. Hercules-specific hardware/compiler issue β€” only observed on Hercules, not Hera or other platforms
  4. Intel oneAPI 2024.2.1 floating-point behavior β€” the loaded compiler stack might handle denormals or edge cases differently
  5. Memory/interconnect issue on specific nodes β€” hercules-05-08 may have intermittent hardware issues

Recommended Next Steps

  1. Update issue #4348 with this data point (commit 3b291f69, Feb 18 nightly, enkfgdas_fcst_mem001 in C48mx500_hybAOWCDA)
  2. Check if this nightly was retried β€” transient failures often pass on retry, confirming non-determinism
  3. Inspect the eupd output for this cycle β€” look for anomalous increment values in member 001
  4. Check other members β€” did mem002–mem030 also fail, or just mem001?
  5. Cross-reference with Hera nightly β€” if the same commit passed on Hera, it confirms Hercules-specific behavior
  6. Node health check β€” request MSS/admin review of hercules-05-08 for ECC memory errors or interconnect issues

Software Environment

Loaded Modules at Time of Failure
 1) contrib/0.1                    40) antlr/2.7.7
 2) intel-oneapi-compilers/2024.2.1 41) gsl/2.8
 ...
39) esmf/8.8.0                     78) gw_run.hercules

Key modules:

  • intel-oneapi-compilers/2024.2.1
  • esmf/8.8.0
  • spack-stack-1.9.2 (environment: ue-oneapi-2024.1.0)

This report was generated by the EIB MCP-RAG analysis system from the emcbot gist log.

⚠️ **GitHub.com Fallback** ⚠️