C96_atm3DVar gdas_atmos_prod_f000 Error Analysis PR4359 - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Error Analysis: C96_atm3DVar β€” gdas_atmos_prod_f000 Failure (PR #4359)

Date: 2025-12-23
Platform: Hercules (MSU)
CI Pipeline: PR #4359 (C96_atm3DVar test case)
PR: #4359 β€” SFS GLORe ICs
Commit: f4f1d1e0
Error Log: gdas_atmos_prod_f000.log
Slurm Job: 7464840 on hercules-04-28
Analysis Method: EIB MCP-RAG GraphRAG Toolset (v3.6.2, 44 tools, 826 LLM community summaries)


Executive Summary

The JGLOBAL_ATMOS_PRODUCTS job (atmosphere GRIB2 post-processing at forecast hour 000) failed with "paramlistb_f000: unbound variable" at line 20 of exglobal_atmos_products.sh during the C96_atm3DVar CI test case on Hercules.

Root cause: PR #4347 ("GCAFSv1 atmospheric products") modified the shared script exglobal_atmos_products.sh to reference a new variable paramlistb_f000, but only added the export in the GCAFS-specific config.atmos_products. The GFS/GDAS config.atmos_products was not updated. Because set -u (nounset) is active, bash aborted immediately when encountering the undefined variable.

PR #4359 is innocent β€” it only adds SFS/GCAFS CI test configs and MOM6_INTERP_ICS support. It makes zero changes to exglobal_atmos_products.sh or config.atmos_products.

The bug was already discovered and reverted via PR #4360 on the same day (Dec 18, 2025), but the CI pipeline for PR #4359 was built against a develop snapshot that still contained the broken code.


1. Error Chain (Chronological)

Step 1 β€” setpdy.sh fails silently (line 57)

setpdy.sh[57] sed 's/[0-9]\{8\}/20211220/' .../RUNTESTS/COMROOT/date/t18z
sed: can't read .../RUNTESTS/COMROOT/date/t18z: No such file or directory

The COMROOT/date/t18z file was missing. This is a known CI infrastructure issue β€” setpdy.sh failure is suppressed by || true in jjob_header.sh (lines 95–96). The PDY variables were already set by the CI driver, so this did not cause the fatal error.

Step 2 β€” ./PDY source fails silently (line 96)

jjob_header.sh: line 96: ./PDY: No such file or directory

Because setpdy.sh failed, ./PDY was never created. The || true pattern suppresses this error. Extended date range variables (PDYm7…PDYp7) were undefined, but the critical $PDY was already exported.

Step 3 β€” Config files loaded successfully

jjob_header.sh successfully sources:

  • config.base β€” Sets machine=HERCULES, RUN=gdas, CASE=C96, MODE=cycled, APP=ATM
  • config.atmos_products β€” Sets MAX_TASKS=25, FHOUT_PGBS=3, FLXGF=NO, WGNE=NO, downset=2

Critical observation: config.atmos_products exports these paramlist variables:

export paramlista="${PARMgfs}/product/gfs.fFFF.paramlist.a.txt"
export paramlista_anl="${PARMgfs}/product/gfs.anl.paramlist.a.txt"
export paramlista_f000="${PARMgfs}/product/gfs.f000.paramlist.a.txt"
export paramlistb="${PARMgfs}/product/gfs.fFFF.paramlist.b.txt"

Missing: paramlistb_f000 and paramlistb_anl β€” these were only added to the GCAFS config variant.

Step 4 β€” FATAL: Unbound variable (line 20)

exglobal_atmos_products.sh: line 20: paramlistb_f000: unbound variable

The script enters the FORECAST_HOUR == 0 branch and attempts to assign:

paramlistb="${paramlistb_f000}"

Since paramlistb_f000 was never exported by the GFS/GDAS config.atmos_products, and set -u is active via preamble.sh, bash aborts immediately with return code 1.

Step 5 β€” Job cleanup and termination

FATAL ERROR: Job atmos_products_f000.1420408 failed RETURN CODE 1
ABNORMAL EXIT at Tue Dec 23 10:34:15 CST 2025 on hercules-04-28

The DATA directory contained only an empty ncepdate file (0 bytes). The OUTPUT.1420469 log was never created because the script failed before producing any output.

Slurm cancelled the job:

slurmstepd: error: *** JOB 7464840 ON hercules-04-28 CANCELLED AT 2025-12-23T10:34:15 ***

2. Root Cause: PR #4347 Cross-System Config Gap

The Offending PR

PR #4347 ("GCAFSv1 atmospheric products") by @bbakernoaa made three key changes:

  1. Modified the shared script scripts/exglobal_atmos_products.sh to reference new variables paramlistb_f000 and paramlistb_anl when FORECAST_HOUR <= 0
  2. Added exports in parm/config/gcafs/config.atmos_products for the new variables
  3. Did NOT add exports in parm/config/gfs/config.atmos_products (used by GFS, GDAS, GEFS)

Timeline

Date Event
Dec 17, 2025 PR #4347 opened. Reviewed by CoryMartin-NOAA, aerorahul, lipan-NOAA
Dec 18, 2025 PR #4347 merged after passing only GCAFS tests on Gaea C6
Dec 18, 2025 DavidHuber-NOAA discovers breakage: "I should have checked this more thoroughly on C6. I only ran the GCAFS tests"
Dec 18, 2025 PR #4360 created and merged to revert PR #4347 (commit 3854275)
Dec 23, 2025 PR #4359 CI job fails on Hercules β€” built against a develop snapshot between #4347 merge and #4360 revert

Why PR #4359's CI Hit the Already-Reverted Bug

The CI pipeline for PR #4359 was triggered (or its develop merge-base snapshot was taken) in the window between the merge of #4347 and the revert #4360 becoming effective. The Hercules CI job executed asynchronously on Dec 23 using a develop snapshot that still contained the broken changes.

Code Symmetry Gap

The exglobal_atmos_products.sh script is shared across all systems (GFS, GDAS, GEFS, GCAFS, SFS). When PR #4347 added paramlistb_f000 usage to this shared script, it created an asymmetry:

Config Variant paramlistb paramlistb_f000 paramlistb_anl
gfs/config.atmos_products βœ… Exported ❌ Missing ❌ Missing
gefs/config.atmos_products βœ… Exported ❌ Missing ❌ Missing
sfs/config.atmos_products βœ… Exported ❌ Missing ❌ Missing
gcafs/config.atmos_products βœ… Exported βœ… Exported βœ… Exported

3. Execution Flow (MCP Tool-Verified)

The following execution chain was verified using EIB MCP tools:

JGLOBAL_ATMOS_PRODUCTS (J-Job)
  β”‚
  β”œβ”€β”€ sources: jjob_header.sh
  β”‚     β”œβ”€β”€ sources: config.base
  β”‚     β”œβ”€β”€ sources: config.atmos_products  ← Missing paramlistb_f000
  β”‚     └── sources: HERCULES.env
  β”‚
  └── invokes: exglobal_atmos_products.sh
        β”‚   Line 20: paramlistb="${paramlistb_f000}"  ← FATAL
        β”‚
        β”œβ”€β”€ calls: interp_atmos_master.sh  (never reached)
        β”œβ”€β”€ calls: interp_atmos_sflux.sh   (never reached)
        β”œβ”€β”€ calls: run_mpmd.sh             (never reached)
        └── calls: product_functions.sh    (never reached)

Environment variable dependencies (from trace_execution_path):

  • COMIN_ATMOS_MASTER, COMIN_ATMOS_ANALYSIS, COMIN_ATMOS_HISTORY
  • FORECAST_HOUR, fhr3, USHgfs, WGNE, PGBS
  • paramlista_anl, FLUX_FILE, DBNROOT, input_file

4. Recommendations

Immediate (Prevent Recurrence)

  1. Cross-system CI testing for shared scripts: When a PR modifies any file under scripts/ or ush/, CI should automatically run all system variants (GFS, GDAS, GCAFS, SFS, GEFS) β€” not just the variant being developed. This is the single most important fix.

  2. Defensive variable references: Use ${paramlistb_f000:-${paramlistb}} fallback patterns when adding system-specific overrides to shared scripts. This ensures unset variables gracefully fall back rather than crashing.

  3. Config parity validation: Add a pre-merge CI check that scans exglobal_*.sh scripts for variable references and verifies that every variable exists in ALL config variants under parm/config/{gfs,gefs,sfs,gcafs}/.

Structural (Long-Term)

  1. Atomic config+script changes: When a config variant introduces a new variable consumed by a shared script, the export must be added to ALL config variants in the same commit (or guarded with ${VAR:-default}).

  2. CI queue ordering for reverts: Revert PRs should be fast-tracked to invalidate or cancel in-flight CI pipelines built against the broken state.

  3. EE2 Bash compliance: The EE2 standard states "Use set to set shell options so that calling your script as bash script_name does not break its functionality." The set -u option is correct and should be maintained β€” the fix belongs in the config, not in weakening error detection.


5. MCP Tool Call Scorecard

30 EIB MCP tool calls were made during this analysis. Each is rated for usefulness.

# Tool Query/Params Return Summary Rating
1 get_job_details JGLOBAL_ATMOS_PRODUCTS Job path, config files (base, atmos_products), env vars, ChromaDB docs HIGH
2 find_env_dependencies paramlistb 0 scripts found (not in env graph) LOW
3 get_code_context exglobal_atmos_products File path, SOURCES/EXPORTS GGSR relationships MEDIUM
4 explain_workflow_component exglobal_atmos_products.sh Generic expert notes only LOW
5 search_architecture atmos_products paramlist grib2 post-processing 5 LLM community summaries (atmospheric ops, data integration) MEDIUM
6 analyze_code_structure scripts/exglobal_atmos_products.sh File not found (path mismatch) FAILED
7 find_callers_callees exglobal_atmos_products Caller: JGLOBAL_ATMOS_PRODUCTS. Callees: interp_atmos_master, run_mpmd, etc. HIGH
8 trace_execution_path exglobal_atmos_products 12 script invocations, env deps (COMIN_ATMOS_MASTER, fhr3, etc.) HIGH
9 search_documentation paramlistb paramlist atmos_products config grib2 f000 Error: resource not found FAILED
10 find_dependencies config.atmos_products No imports/exporters found LOW
11 search_issues paramlistb unbound variable atmos_products No issues found LOW
12 get_operational_guidance post-processing Invalid input error (enum constraint) FAILED
13 trace_data_flow config.atmos_products No data flow found LOW
14 find_env_dependencies paramlista GGSR found exglobal_atmos_products.sh (weight 0.8) MEDIUM
15 search_ee2_standards unbound variable defensive coding bash set -u nounset 10 standards (EE2 shell guidelines) MEDIUM
16 get_change_impact exglobal_atmos_products Risk: LOW (0.10), 0 dependents β€” underestimate LOW
17 explain_with_context JGLOBAL_ATMOS_PRODUCTS job atmos grib2 post-processing Empty summary returned FAILED
18 describe_component config.atmos_products Full file content β€” revealed missing paramlistb_f000 export HIGH
19 list_job_scripts (none) 90 jobs across 5 categories MEDIUM
20 describe_component exglobal_atmos_products.sh Full script preview (13KB) β€” showed f000 branching logic HIGH
21 get_pull_requests (none) 10 open PRs, none related (already merged) LOW
22 analyze_workflow_dependencies exglobal_atmos_products.sh No upstream/downstream found LOW
23 search_issues Various queries (Γ—3) No issues found LOW
24 find_similar_code paramlistb_f000 No code above 0.7 similarity threshold LOW
25 find_similar_code config.atmos_products paramlistb paramlistb_f000 No code above 0.7 similarity LOW
26 analyze_ee2_compliance Script/config mismatch description Relevant EE2 shell standards returned MEDIUM
27 trace_full_execution_chain config.atmos_products Invalid input error (enum constraint) FAILED
28 get_knowledge_base_status (none) 64,657 docs, 589,396 relationships, all healthy MEDIUM
29 validate_sdd_compliance Config mismatch description 2 passed, 1 warning (generic) LOW
30 get_workflow_structure (none) System layout: jobs→scripts→ush→parm MEDIUM

Scorecard Summary

Rating Count %
HIGH 5 17%
MEDIUM 8 27%
LOW 12 40%
FAILED 5 17%

Top 5 Most Useful Tools

  1. describe_component (config.atmos_products) β€” Revealed the missing paramlistb_f000 export β€” the smoking gun
  2. describe_component (exglobal_atmos_products.sh) β€” Full script showing the f000 branching logic at line 20
  3. find_callers_callees β€” Complete call tree with GGSR weights, confirmed JGLOBAL β†’ exglobal chain
  4. trace_execution_path β€” Full environment variable dependency mapping across the execution chain
  5. get_job_details β€” J-Job config sourcing chain (base + atmos_products)

Key Gaps Identified

  • find_env_dependencies doesn't track variables defined in config scripts (only direct export in tracked scripts)
  • get_change_impact underestimates blast radius for shared scripts (scored 0.10 for a script used by every forecast cycle)
  • explain_with_context and explain_workflow_component returned empty/generic content
  • get_operational_guidance has strict undiscoverable enum constraints
  • search_architecture LLM community summaries are well-written but none covered the specific atmos_products paramlist configuration pattern

6. Related PRs and Issues

PR/Issue Description Status
PR #4347 GCAFSv1 atmospheric products (introduced the bug) Merged Dec 18, then reverted
PR #4359 SFS GLORe ICs (innocent bystander) Merged Jan 7
PR #4360 Revert "GCAFSv1 atmospheric products" Merged Dec 18
PR #4407 Adding GCAFS products v2 (fixed resubmission) Open
Issue #4346 GCAFSv1 product creation Closed by #4347

Analysis performed using the EIB MCP-RAG platform v3.6.2 with 44 tools, 826 LLM-generated community summaries (Phase 24E-6), and Graph-Guided Semantic Retrieval (GGSR). February 25, 2026.