Nightly_Failure_Analysis_enkfgdas_fcst_mem001_Hercules_2026 02 18 - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Date of Failure: 2026-02-18 22:26β22:27 CST
Date of Analysis: 2026-02-19
Analyst: EIB MCP-RAG Automated Analysis
Source Log: emcbot/enkfgdas_fcst_mem001.log
The nightly CI/CD run on Hercules failed in the enkfgdas ensemble forecast (member 001) due to a MOM6 NaN numerical instability. The ocean model's mpp_reproducing_sum detected NaN values in a parallel summation field and issued a FATAL abort, killing all 80 MPI ranks. This is an instance of the known transient instability on Hercules tracked in issue #4348.
MOM6 NaN numerical instability β the ocean model detected NaN (Not a Number) values in a summation field, triggering a fatal abort:
FATAL from PE 2: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
This originated from mpp_reproducing_sum in MOM6's FMS infrastructure. The function performs bit-reproducible parallel summation and includes NaN guards β when it encounters NaN values in the input array, it fatally aborts rather than propagating corrupted math.
| Step | Detail |
|---|---|
| NaN source | PE 2 detected NaN during MOM6 ocean model timestepping |
| FATAL abort |
mpp_reproducing_sum(_2d) killed the application |
| srun kill |
tasks 0-79: Killed on hercules-05-08 (all 80 MPI ranks) |
| Exit code |
err=137 (SIGKILL β signal 9) |
| err_exit | The forecast failed to run to completion! RETURN CODE 137 |
| SLURM | JOB 7968733 CANCELLED DUE to SIGNAL Terminated |
Line 8695: FATAL from PE 2: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
Line 8697: srun: error: hercules-05-08: tasks 0-79: Killed
Line 8698: srun: Terminating StepId=7968733.0
Line 8700: err=137
Line 8706: -- FATAL ERROR: The forecast failed to run to completion! RETURN CODE 137
Line 8707: -- ABNORMAL EXIT at Wed Feb 18 22:27:44 CST 2026 on hercules-05-08
Line 8854: [2026-02-18T22:27:45.103] error: *** JOB 7968733 ON hercules-05-08 CANCELLED AT 2026-02-18T22:27:45 DUE to SIGNAL Terminated ***
| Parameter | Value |
|---|---|
| Job |
enkfgdas_fcst_mem001 (EnKF GDAS ensemble forecast, member 001) |
| Experiment | C48mx500_hybAOWCDA_3b291f69-9296 |
| Nightly Build | nightly_0_3b291f69_9296 |
| Commit | 3b291f69 |
| Cycle |
2021032500 (test case: 2021-03-25 00Z) |
| Coupled model | S2S β FV3ATM (C48) + MOM6 (mx500) + CICE6 (mx500) |
| UFS configure | ufs.configure.s2s.IN |
| Node | hercules-05-08 |
| SLURM Job ID |
7968733 (sub-step 3283753) |
| Wall time used | ~97 seconds (22:26:07 to 22:27:44 CST) |
| MPI tasks | 80 ranks, 1 thread per task |
| Launcher | srun -l --export=ALL --hint=nomultithread -n 30 |
| Machine | HERCULES |
| Compiler | Intel oneAPI 2024.2.1 |
| ESMF | 8.8.0 |
| RUN | enkfgdas |
| NMEM_ENS_GFS | 30 |
HOMEgfs: /work2/noaa/global/role-global/GFS_CI_CD/HERCULES/BUILDS/GITLAB/nightly_0_3b291f69_9296/global-workflow
EXPDIR: .../RUNTESTS/EXPDIR/C48mx500_hybAOWCDA_3b291f69-9296
ROTDIR: .../RUNTESTS/COMROOT/C48mx500_hybAOWCDA_3b291f69-9296
DATA: /work2/noaa/global/role-global/stmp/HERCULES/RUNDIRS/C48mx500_hybAOWCDA_3b291f69-9296/enkfgdas.2021032500/enkfgdasefcs001.2021032500/fcst.3283753
The forecast job sourced the following configuration chain (all loaded successfully):
-
config.baseβ Machine: HERCULES, PSLOT:C48mx500_hybAOWCDA_3b291f69-9296 -
config.comβ COM path templates -
config.fcstβ Forecast settings,FHMAX_HF=0,FHOUT_HF=0 -
config.ufsβ UFS coupling settings (S2S: FV3+MOM6+CICE6) -
config.ocnβ Ocean:MESH_OCN=mesh.mx500.nc,ODA_INCUPD=True -
config.iceβ Ice:min_seaice=1.0e-6 -
config.efcsβ EnKF forecast:CASE=C48 -
config.nsstβ NSST:NST_MODEL=2 -
config.resourcesβ Resources:walltime=00:20:00 -
config.resources.HERCULESβ Hercules-specific resource overrides -
HERCULES.envβ Environment:OMP_STACKSIZE=512M,FI_MLX_INJECT_LIMIT=0
| Line | Warning | Impact |
|---|---|---|
| 103 | sed: can't read .../COMROOT/date/t00z: No such file or directory |
Handled by true β CI setup pattern |
| 106 | ./PDY: No such file or directory |
Handled by true β CI setup pattern |
| 7385 | unable to interpolate. filled with nearest point value at 42 points |
Normal FV3 surface field interpolation behavior |
| 8852 | cat: OUTPUT.3283811: No such file or directory |
Model output not written before crash |
These warnings are not causal to the failure.
This failure is an exact match for the open tracking issue:
- Filed by: @DavidHuber-NOAA on 2025-12-17
- Status: Open (reopened)
- Assignee: @DavidHuber-NOAA
From the issue description:
"The nightly runs are periodically reporting NaNs when running ensemble forecast jobs. Last night's (12/17/2025) nightly showed the failure on the first full cycle of the
C96C48mx500_S2SW_cyc_gfstest and the night before on theC96C48_hybatmDAtest. Since the pattern seems to be the ensemble forecasts in different jobs, it may suggest an issue with theeupdjob."
| Attribute | Issue #4348 (Dec 2025) | This Failure (Feb 2026) |
|---|---|---|
| Platform | Hercules only | Hercules |
| Error | NaN in ensemble forecasts | NaN in mpp_reproducing_sum
|
| Job type | enkfgdas ensemble fcst | enkfgdas_fcst_mem001 |
| Tests affected |
C96C48mx500_S2SW_cyc_gfs, C96C48_hybatmDA
|
C48mx500_hybAOWCDA |
| Transient | Yes β does not reproduce consistently | TBD |
| Suspected source |
eupd job upstream |
Same hypothesis applies |
This confirms the issue is still active as of February 2026.
-
EnKF update (eupd) introduced bad increments β the issue author suspects the
eupdjob may be the upstream source, since different ensemble members in different tests fail - MOM6 ocean state corruption β NaN propagation from an unstable ocean grid cell at mx500 resolution
- Hercules-specific hardware/compiler issue β only observed on Hercules, not Hera or other platforms
- Intel oneAPI 2024.2.1 floating-point behavior β the loaded compiler stack might handle denormals or edge cases differently
-
Memory/interconnect issue on specific nodes β
hercules-05-08may have intermittent hardware issues
-
Update issue #4348 with this data point (commit
3b291f69, Feb 18 nightly,enkfgdas_fcst_mem001inC48mx500_hybAOWCDA) - Check if this nightly was retried β transient failures often pass on retry, confirming non-determinism
-
Inspect the
eupdoutput for this cycle β look for anomalous increment values in member 001 - Check other members β did mem002βmem030 also fail, or just mem001?
- Cross-reference with Hera nightly β if the same commit passed on Hera, it confirms Hercules-specific behavior
-
Node health check β request MSS/admin review of
hercules-05-08for ECC memory errors or interconnect issues
Loaded Modules at Time of Failure
1) contrib/0.1 40) antlr/2.7.7
2) intel-oneapi-compilers/2024.2.1 41) gsl/2.8
...
39) esmf/8.8.0 78) gw_run.hercules
Key modules:
intel-oneapi-compilers/2024.2.1esmf/8.8.0-
spack-stack-1.9.2(environment:ue-oneapi-2024.1.0)
This report was generated by the EIB MCP-RAG analysis system from the emcbot gist log.