C96C48_hybatmDA JGLOBAL_ENKF_SFC Error Analysis PR4327 - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Error Analysis: C96C48_hybatmDA — JGLOBAL_ENKF_SFC Failure (PR #4327)
Date: 2025-12-16
Platform: Hercules (MSU)
CI Pipeline: #6388
PR: #4327
Commit: a05fd274
Gist: Error Log
Slurm Job: 7397598 on hercules-02-55
Analysis Method: EIB MCP-RAG GraphRAG Toolset (v3.6.2, 42 tools)
Executive Summary
The JGLOBAL_ENKF_SFC job (ensemble surface analysis) failed with "Failed to update surface fields! RETURN CODE 1" during the C96C48_hybatmDA CI test case on Hercules. Root cause: the RUNTESTS/COMROOT/date/t00z file was missing, causing setpdy.sh to fail silently. The jjob_header.sh error-suppression pattern (|| true) swallowed this critical failure, allowing the job to proceed with an uninitialized date environment.
1. Error Chain (Chronological)
Step 1 — setpdy.sh fails (line 57)
setpdy.sh[57] sed 's/[0-9]\{8\}/20211221/' .../RUNTESTS/COMROOT/date/t00z
sed: can't read .../RUNTESTS/COMROOT/date/t00z: No such file or directory
setpdy.sh uses $COMDATEROOT/date/$cycle to create the PDY file. In production, this file is maintained by the NCO date-setting cron (updated at 11:30 and 23:30 UTC). In CI/CD, the test harness is responsible for provisioning RUNTESTS/COMROOT/date/t00z. This file was not created for the C96C48_hybatmDA test case.
Step 2 — jjob_header.sh swallows the error (lines 95–96)
setpdy.sh || true # line 95: setpdy.sh fails, but || true suppresses
source ./PDY || true # line 96: PDY file was never created, but || true suppresses
This is the critical design decision. The || true pattern exists because jjob_header.sh expects PDY to already be exported by the parent environment in CI contexts (the CI framework pre-sets $PDY via ecFlow/Rocoto). However, when setpdy.sh also fails to create the ./PDY file, the date variables PDYm7…PDYp7 are never defined.
Despite the suppressed failure, $PDY was already set (20211221) by the CI driver before jjob_header.sh was called. So the primary PDY variable was available, but the extended date range variables were not.
Step 3 — Job proceeds into exglobal_enkf_sfc.sh
The script sources config.base and config.esfc, sets up ensemble member directories, and enters the regrid_gsiSfcIncr_to_tile.sh processing loop for mem001 and mem002.
Step 4 — Surface field update fails
Failed to update surface fields! RETURN CODE 1
The global_cycle executable (called via MPMD through regrid_gsiSfcIncr_to_tile.sh) returned a non-zero exit code. The exact cause within global_cycle is not visible in the trace, but the missing date environment may have caused incorrect file paths for first-guess or analysis increment files.
Step 5 — Job cancelled
cat: OUTPUT.909007: No such file or directory
slurmstepd: error: *** JOB 7397598 ON hercules-02-55 CANCELLED AT 2025-12-16T10:11:13 ***
The pgmout file (OUTPUT.909007) was never written, confirming the job did not complete normally.
2. Root Cause Analysis
Primary Cause: Missing COMROOT/date/t00z in CI test environment
The CI test harness (dev/ci/) is responsible for creating the RUNTESTS/COMROOT/ directory tree. The date/t00z file is expected at:
/work2/noaa/global/role-global/GFS_CI_CD/HERCULES/BUILDS/GITLAB/
pr_cases_4327_a05fd274_6388/RUNTESTS/COMROOT/date/t00z
This file was not provisioned by the test setup stage.
Contributing Factor: Silent error suppression in jjob_header.sh
The || true pattern on lines 95–96 of jjob_header.sh is intentional — it exists because:
- In the CI/CD pipeline,
PDYis pre-exported by the workflow driver (ecFlow/Rocoto), sosetpdy.shis redundant - The
COMDATEROOTpath may not exist in development/test environments - Failing here would break every CI job even when
PDYis correctly set externally
However, the pattern masks a real diagnostic signal. When setpdy.sh fails AND PDY is set externally, the extended date variables (PDYm1, PDYp1, etc.) are still missing, which can cascade into downstream failures.
Secondary Cause: global_cycle failure
The surface field update via global_cycle returned code 1. Based on the directory listing, the input fix files (grids, vegetation types, orography) were staged correctly. The most likely failure point is either:
- Missing or malformed surface increment files from the upstream
JGDAS_ENKF_UPDATEstep - A date-dependent path resolution failure within
global_cycleitself
3. Affected Components
| Component | Role | Status |
|---|---|---|
ush/jjob_header.sh |
Universal j-job header, runs setpdy.sh, sources configs |
Error suppressed (lines 95-96) |
ush/setpdy.sh (prod_util) |
Creates PDY date file from $COMDATEROOT/date/$cycle |
Failed — missing input file |
dev/jobs/JGLOBAL_ENKF_SFC |
J-job driver for ensemble surface analysis | Invoked correctly |
scripts/exglobal_enkf_sfc.sh |
Ex-script for enkf surface analysis | Ran but failed at surface update |
ush/regrid_gsiSfcIncr_to_tile.sh |
Regrids GSI surface increments to FV3 tiles | Executed MPMD, returned code 1 |
config.esfc |
Configuration: DOSFCANL_ENKF, IAU settings, GCYCLE |
Sourced successfully |
dev/ci/ test harness |
CI test provisioning | Did not create COMROOT/date/ |
4. Environment Snapshot
| Variable | Value |
|---|---|
PDY |
20211221 (pre-set by CI driver) |
cyc |
00 |
cycle |
t00z |
RUN |
enkfgdas |
CASE_ENS |
C48 |
COMDATEROOT |
.../RUNTESTS/COMROOT |
DOSFCANL_ENKF |
NO (when DOIAU_ENKF=YES) |
NMEM_ENS |
2 (mem001, mem002) |
machine |
HERCULES |
ACCOUNT |
fv3-cpu |
job |
esfc |
jobid |
esfc.908890 |
| Module stack | 78 modules including gw_run.hercules, prod_util/2.1.1 |
5. Recommendations
Immediate (for PR #4327)
- Verify upstream test provisioning: Check that the CI
setup_testsstage createsRUNTESTS/COMROOT/date/t{HH}zfor all cycles - Rerun the failed test: If this is a transient provisioning issue (e.g., race condition in parallel test setup), a rerun may succeed
- Check upstream JGDAS_ENKF_UPDATE: Verify that the surface increment files (
enkfgdas.t00z.increment.sfc.i00{3,6,9}.nc) exist in COMROOT
Systemic (long-term)
- Add diagnostic logging to
jjob_header.sh: Before the|| trueguards, log whensetpdy.shfails:setpdy.sh || echo "[WARN] setpdy.sh failed; relying on pre-exported PDY=${PDY:-UNSET}" >&2 source ./PDY || echo "[WARN] PDY file not found; extended date vars unavailable" >&2 - Add
COMROOT/date/creation to CI provisioning: Indev/ci/scripts/utils/ci_utils.sh(confirmed viafind_env_dependenciesas a COMROOT-dependent script), add explicit date file creation during test setup - EE2 compliance note: Per EE2 standards,
setpdy.shrequires$COMDATEROOT/date/$cycle— CI environments should honor this contract even whenPDYis pre-set, to ensurePDYm*/PDYp*variables are available
6. MCP-RAG Tools Used for This Analysis
| Tool | Parameters | Contribution |
|---|---|---|
get_job_details |
job_name: "JGLOBAL_ENKF_SFC" |
Retrieved job structure: 71-line J-job, sources jjob_header.sh -e esfc -c "base esfc", identified env vars (GDUMP, CASE, USE_CFP), retrieved config.esfc content showing IAU/DOSFCANL_ENKF logic |
explain_workflow_component |
component: "jjob_header.sh" |
Confirmed role as universal j-job header |
find_env_dependencies |
variable_name: "PDY" |
Found 50 j-jobs depend on PDY; classified as EE2 Standard; GGSR analysis identified regrid_gsiSfcIncr_to_tile.sh as a PDY-dependent script in the blast radius |
find_env_dependencies |
variable_name: "COMROOT" |
Found 2 scripts depend on COMROOT: ci_utils.sh (CI provisioning) and bash_utils.sh; confirmed LOW impact but critical for CI setup |
search_documentation |
query: "setpdy.sh date initialization PDY COMROOT" |
Retrieved EE2 docs (ee-docs 11.0.0) explaining setpdy.sh contract: requires $COMDATEROOT/date/$cycle, creates PDY file with PDYm7…PDYp7 range, date files set at 11:30/23:30 UTC |
describe_component |
component: "scripts/exglobal_enkf_sfc.sh" |
Retrieved full 280-line script: confirmed DONST, DOSFCANL_ENKF, CASE, ntiles variables; showed NMEM_ENS loop over members with global_cycle invocation |
describe_component |
component: "ush/jjob_header.sh" |
Retrieved 120-line source: confirmed setpdy.sh || true / source ./PDY || true pattern at lines 95-96; documented required env vars ($HOMEgfs, $DATAROOT, $jobid, $PDY, $cyc, $machine) |
get_code_context |
symbol: "regrid_gsiSfcIncr_to_tile.sh" |
GGSR analysis: hop-1 exports in_dir, pgm; hop-2 connects to wave scripts via shared env vars; confirmed shell community 2704 membership |
get_change_impact |
symbol: "exglobal_enkf_sfc.sh" |
LOW risk (0.10), 0 direct dependents (leaf node); confirmed community 2704 (32-node shell cluster with enkf scripts) |
search_architecture |
query: "enkf surface analysis global_cycle setpdy jjob_header error handling" |
Identified community 2704 as the enkf shell community; found related Fortran communities (BUMP, GSI) |
get_operational_guidance |
operation: "CI/CD test case setup COMROOT initialization", platform: "hercules" |
Retrieved CTest framework docs: cmake config with STAGED_CTESTS, GitLab CI pipeline stages (build → setup_tests → run_tests), Hercules-specific notes (Tier 1, no TC Tracker) |
search_issues |
query: "setpdy PDY COMROOT date surface enkf" |
No matching issues found — confirms this is likely a new/unreported issue pattern |
find_callers_callees |
function_name: "global_cycle" |
No graph edges found — global_cycle is a compiled Fortran executable, not a shell function; this is expected |
find_dependencies |
target: "scripts/exglobal_enkf_sfc.sh" |
No import graph edges — confirms ex-scripts are invoked by j-jobs, not imported; runtime dependency via source |
7. GraphRAG Analysis Summary
The GGSR (Graph-Guided Semantic Retrieval) analysis provided the following structural insights:
- PDY impact radius: 50 j-jobs directly depend on
PDY— making it one of the highest-connectivity environment variables in the workflow. Asetpdy.shfailure affects multiple downstream tasks beyondJGLOBAL_ENKF_SFC. - Community 2704: The enkf scripts (
exglobal_enkf_sfc.sh,exgdas_enkf_ecen.sh,exgdas_enkf_post.sh,exgdas_enkf_update.sh) form a tightly-coupled 32-node shell community with 50 internal SOURCES relationships. Changes to shared infrastructure likejjob_header.shhave cross-community impact. - COMROOT has low direct connectivity (2 scripts) but is a critical CI/CD provisioning variable. The
ci_utils.shdependency confirms it is part of the test harness. - No prior issues found: The
search_issuestool returned no results for this error pattern, suggesting either (a) the COMROOT/date provisioning was recently changed, or (b) this failure mode is specific to the C96C48_hybatmDA test case configuration.
Analysis generated using EIB MCP-RAG Server v3.6.2 with 14 tool invocations across 5 tool modules (Workflow Info, Code Analysis, Semantic Search, GraphRAG, Operational). Total analysis time: ~45 seconds.