C96_atm3DVar gdas_atmos_prod_f000 Error Analysis PR4359 - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Error Analysis: C96_atm3DVar β gdas_atmos_prod_f000 Failure (PR #4359)
Date: 2025-12-23
Platform: Hercules (MSU)
CI Pipeline: PR #4359 (C96_atm3DVar test case)
PR: #4359 β SFS GLORe ICs
Commit: f4f1d1e0
Error Log: gdas_atmos_prod_f000.log
Slurm Job: 7464840 on hercules-04-28
Analysis Method: EIB MCP-RAG GraphRAG Toolset (v3.6.2, 44 tools, 826 LLM community summaries)
Executive Summary
The JGLOBAL_ATMOS_PRODUCTS job (atmosphere GRIB2 post-processing at forecast hour 000) failed with "paramlistb_f000: unbound variable" at line 20 of exglobal_atmos_products.sh during the C96_atm3DVar CI test case on Hercules.
Root cause: PR #4347 ("GCAFSv1 atmospheric products") modified the shared script exglobal_atmos_products.sh to reference a new variable paramlistb_f000, but only added the export in the GCAFS-specific config.atmos_products. The GFS/GDAS config.atmos_products was not updated. Because set -u (nounset) is active, bash aborted immediately when encountering the undefined variable.
PR #4359 is innocent β it only adds SFS/GCAFS CI test configs and MOM6_INTERP_ICS support. It makes zero changes to exglobal_atmos_products.sh or config.atmos_products.
The bug was already discovered and reverted via PR #4360 on the same day (Dec 18, 2025), but the CI pipeline for PR #4359 was built against a develop snapshot that still contained the broken code.
1. Error Chain (Chronological)
Step 1 β setpdy.sh fails silently (line 57)
setpdy.sh[57] sed 's/[0-9]\{8\}/20211220/' .../RUNTESTS/COMROOT/date/t18z
sed: can't read .../RUNTESTS/COMROOT/date/t18z: No such file or directory
The COMROOT/date/t18z file was missing. This is a known CI infrastructure issue β setpdy.sh failure is suppressed by || true in jjob_header.sh (lines 95β96). The PDY variables were already set by the CI driver, so this did not cause the fatal error.
Step 2 β ./PDY source fails silently (line 96)
jjob_header.sh: line 96: ./PDY: No such file or directory
Because setpdy.sh failed, ./PDY was never created. The || true pattern suppresses this error. Extended date range variables (PDYm7β¦PDYp7) were undefined, but the critical $PDY was already exported.
Step 3 β Config files loaded successfully
jjob_header.sh successfully sources:
config.baseβ Setsmachine=HERCULES,RUN=gdas,CASE=C96,MODE=cycled,APP=ATMconfig.atmos_productsβ SetsMAX_TASKS=25,FHOUT_PGBS=3,FLXGF=NO,WGNE=NO,downset=2
Critical observation: config.atmos_products exports these paramlist variables:
export paramlista="${PARMgfs}/product/gfs.fFFF.paramlist.a.txt"
export paramlista_anl="${PARMgfs}/product/gfs.anl.paramlist.a.txt"
export paramlista_f000="${PARMgfs}/product/gfs.f000.paramlist.a.txt"
export paramlistb="${PARMgfs}/product/gfs.fFFF.paramlist.b.txt"
Missing: paramlistb_f000 and paramlistb_anl β these were only added to the GCAFS config variant.
Step 4 β FATAL: Unbound variable (line 20)
exglobal_atmos_products.sh: line 20: paramlistb_f000: unbound variable
The script enters the FORECAST_HOUR == 0 branch and attempts to assign:
paramlistb="${paramlistb_f000}"
Since paramlistb_f000 was never exported by the GFS/GDAS config.atmos_products, and set -u is active via preamble.sh, bash aborts immediately with return code 1.
Step 5 β Job cleanup and termination
FATAL ERROR: Job atmos_products_f000.1420408 failed RETURN CODE 1
ABNORMAL EXIT at Tue Dec 23 10:34:15 CST 2025 on hercules-04-28
The DATA directory contained only an empty ncepdate file (0 bytes). The OUTPUT.1420469 log was never created because the script failed before producing any output.
Slurm cancelled the job:
slurmstepd: error: *** JOB 7464840 ON hercules-04-28 CANCELLED AT 2025-12-23T10:34:15 ***
2. Root Cause: PR #4347 Cross-System Config Gap
The Offending PR
PR #4347 ("GCAFSv1 atmospheric products") by @bbakernoaa made three key changes:
- Modified the shared script
scripts/exglobal_atmos_products.shto reference new variablesparamlistb_f000andparamlistb_anlwhenFORECAST_HOUR <= 0 - Added exports in
parm/config/gcafs/config.atmos_productsfor the new variables - Did NOT add exports in
parm/config/gfs/config.atmos_products(used by GFS, GDAS, GEFS)
Timeline
| Date | Event |
|---|---|
| Dec 17, 2025 | PR #4347 opened. Reviewed by CoryMartin-NOAA, aerorahul, lipan-NOAA |
| Dec 18, 2025 | PR #4347 merged after passing only GCAFS tests on Gaea C6 |
| Dec 18, 2025 | DavidHuber-NOAA discovers breakage: "I should have checked this more thoroughly on C6. I only ran the GCAFS tests" |
| Dec 18, 2025 | PR #4360 created and merged to revert PR #4347 (commit 3854275) |
| Dec 23, 2025 | PR #4359 CI job fails on Hercules β built against a develop snapshot between #4347 merge and #4360 revert |
Why PR #4359's CI Hit the Already-Reverted Bug
The CI pipeline for PR #4359 was triggered (or its develop merge-base snapshot was taken) in the window between the merge of #4347 and the revert #4360 becoming effective. The Hercules CI job executed asynchronously on Dec 23 using a develop snapshot that still contained the broken changes.
Code Symmetry Gap
The exglobal_atmos_products.sh script is shared across all systems (GFS, GDAS, GEFS, GCAFS, SFS). When PR #4347 added paramlistb_f000 usage to this shared script, it created an asymmetry:
| Config Variant | paramlistb |
paramlistb_f000 |
paramlistb_anl |
|---|---|---|---|
gfs/config.atmos_products |
β Exported | β Missing | β Missing |
gefs/config.atmos_products |
β Exported | β Missing | β Missing |
sfs/config.atmos_products |
β Exported | β Missing | β Missing |
gcafs/config.atmos_products |
β Exported | β Exported | β Exported |
3. Execution Flow (MCP Tool-Verified)
The following execution chain was verified using EIB MCP tools:
JGLOBAL_ATMOS_PRODUCTS (J-Job)
β
βββ sources: jjob_header.sh
β βββ sources: config.base
β βββ sources: config.atmos_products β Missing paramlistb_f000
β βββ sources: HERCULES.env
β
βββ invokes: exglobal_atmos_products.sh
β Line 20: paramlistb="${paramlistb_f000}" β FATAL
β
βββ calls: interp_atmos_master.sh (never reached)
βββ calls: interp_atmos_sflux.sh (never reached)
βββ calls: run_mpmd.sh (never reached)
βββ calls: product_functions.sh (never reached)
Environment variable dependencies (from trace_execution_path):
COMIN_ATMOS_MASTER,COMIN_ATMOS_ANALYSIS,COMIN_ATMOS_HISTORYFORECAST_HOUR,fhr3,USHgfs,WGNE,PGBSparamlista_anl,FLUX_FILE,DBNROOT,input_file
4. Recommendations
Immediate (Prevent Recurrence)
-
Cross-system CI testing for shared scripts: When a PR modifies any file under
scripts/orush/, CI should automatically run all system variants (GFS, GDAS, GCAFS, SFS, GEFS) β not just the variant being developed. This is the single most important fix. -
Defensive variable references: Use
${paramlistb_f000:-${paramlistb}}fallback patterns when adding system-specific overrides to shared scripts. This ensures unset variables gracefully fall back rather than crashing. -
Config parity validation: Add a pre-merge CI check that scans
exglobal_*.shscripts for variable references and verifies that every variable exists in ALL config variants underparm/config/{gfs,gefs,sfs,gcafs}/.
Structural (Long-Term)
-
Atomic config+script changes: When a config variant introduces a new variable consumed by a shared script, the export must be added to ALL config variants in the same commit (or guarded with
${VAR:-default}). -
CI queue ordering for reverts: Revert PRs should be fast-tracked to invalidate or cancel in-flight CI pipelines built against the broken state.
-
EE2 Bash compliance: The EE2 standard states "Use
setto set shell options so that calling your script asbash script_namedoes not break its functionality." Theset -uoption is correct and should be maintained β the fix belongs in the config, not in weakening error detection.
5. MCP Tool Call Scorecard
30 EIB MCP tool calls were made during this analysis. Each is rated for usefulness.
| # | Tool | Query/Params | Return Summary | Rating |
|---|---|---|---|---|
| 1 | get_job_details |
JGLOBAL_ATMOS_PRODUCTS |
Job path, config files (base, atmos_products), env vars, ChromaDB docs | HIGH |
| 2 | find_env_dependencies |
paramlistb |
0 scripts found (not in env graph) | LOW |
| 3 | get_code_context |
exglobal_atmos_products |
File path, SOURCES/EXPORTS GGSR relationships | MEDIUM |
| 4 | explain_workflow_component |
exglobal_atmos_products.sh |
Generic expert notes only | LOW |
| 5 | search_architecture |
atmos_products paramlist grib2 post-processing |
5 LLM community summaries (atmospheric ops, data integration) | MEDIUM |
| 6 | analyze_code_structure |
scripts/exglobal_atmos_products.sh |
File not found (path mismatch) | FAILED |
| 7 | find_callers_callees |
exglobal_atmos_products |
Caller: JGLOBAL_ATMOS_PRODUCTS. Callees: interp_atmos_master, run_mpmd, etc. | HIGH |
| 8 | trace_execution_path |
exglobal_atmos_products |
12 script invocations, env deps (COMIN_ATMOS_MASTER, fhr3, etc.) | HIGH |
| 9 | search_documentation |
paramlistb paramlist atmos_products config grib2 f000 |
Error: resource not found | FAILED |
| 10 | find_dependencies |
config.atmos_products |
No imports/exporters found | LOW |
| 11 | search_issues |
paramlistb unbound variable atmos_products |
No issues found | LOW |
| 12 | get_operational_guidance |
post-processing |
Invalid input error (enum constraint) | FAILED |
| 13 | trace_data_flow |
config.atmos_products |
No data flow found | LOW |
| 14 | find_env_dependencies |
paramlista |
GGSR found exglobal_atmos_products.sh (weight 0.8) | MEDIUM |
| 15 | search_ee2_standards |
unbound variable defensive coding bash set -u nounset |
10 standards (EE2 shell guidelines) | MEDIUM |
| 16 | get_change_impact |
exglobal_atmos_products |
Risk: LOW (0.10), 0 dependents β underestimate | LOW |
| 17 | explain_with_context |
JGLOBAL_ATMOS_PRODUCTS job atmos grib2 post-processing |
Empty summary returned | FAILED |
| 18 | describe_component |
config.atmos_products |
Full file content β revealed missing paramlistb_f000 export |
HIGH |
| 19 | list_job_scripts |
(none) | 90 jobs across 5 categories | MEDIUM |
| 20 | describe_component |
exglobal_atmos_products.sh |
Full script preview (13KB) β showed f000 branching logic | HIGH |
| 21 | get_pull_requests |
(none) | 10 open PRs, none related (already merged) | LOW |
| 22 | analyze_workflow_dependencies |
exglobal_atmos_products.sh |
No upstream/downstream found | LOW |
| 23 | search_issues |
Various queries (Γ3) | No issues found | LOW |
| 24 | find_similar_code |
paramlistb_f000 |
No code above 0.7 similarity threshold | LOW |
| 25 | find_similar_code |
config.atmos_products paramlistb paramlistb_f000 |
No code above 0.7 similarity | LOW |
| 26 | analyze_ee2_compliance |
Script/config mismatch description | Relevant EE2 shell standards returned | MEDIUM |
| 27 | trace_full_execution_chain |
config.atmos_products |
Invalid input error (enum constraint) | FAILED |
| 28 | get_knowledge_base_status |
(none) | 64,657 docs, 589,396 relationships, all healthy | MEDIUM |
| 29 | validate_sdd_compliance |
Config mismatch description | 2 passed, 1 warning (generic) | LOW |
| 30 | get_workflow_structure |
(none) | System layout: jobsβscriptsβushβparm | MEDIUM |
Scorecard Summary
| Rating | Count | % |
|---|---|---|
| HIGH | 5 | 17% |
| MEDIUM | 8 | 27% |
| LOW | 12 | 40% |
| FAILED | 5 | 17% |
Top 5 Most Useful Tools
describe_component(config.atmos_products) β Revealed the missingparamlistb_f000export β the smoking gundescribe_component(exglobal_atmos_products.sh) β Full script showing the f000 branching logic at line 20find_callers_calleesβ Complete call tree with GGSR weights, confirmed JGLOBAL β exglobal chaintrace_execution_pathβ Full environment variable dependency mapping across the execution chainget_job_detailsβ J-Job config sourcing chain (base + atmos_products)
Key Gaps Identified
find_env_dependenciesdoesn't track variables defined in config scripts (only directexportin tracked scripts)get_change_impactunderestimates blast radius for shared scripts (scored 0.10 for a script used by every forecast cycle)explain_with_contextandexplain_workflow_componentreturned empty/generic contentget_operational_guidancehas strict undiscoverable enum constraintssearch_architectureLLM community summaries are well-written but none covered the specific atmos_products paramlist configuration pattern
6. Related PRs and Issues
| PR/Issue | Description | Status |
|---|---|---|
| PR #4347 | GCAFSv1 atmospheric products (introduced the bug) | Merged Dec 18, then reverted |
| PR #4359 | SFS GLORe ICs (innocent bystander) | Merged Jan 7 |
| PR #4360 | Revert "GCAFSv1 atmospheric products" | Merged Dec 18 |
| PR #4407 | Adding GCAFS products v2 (fixed resubmission) | Open |
| Issue #4346 | GCAFSv1 product creation | Closed by #4347 |
Analysis performed using the EIB MCP-RAG platform v3.6.2 with 44 tools, 826 LLM-generated community summaries (Phase 24E-6), and Graph-Guided Semantic Retrieval (GGSR). February 25, 2026.