C96C48_hybatmDA JGLOBAL_ENKF_SFC Error Analysis PR4327 - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Error Analysis: C96C48_hybatmDA — JGLOBAL_ENKF_SFC Failure (PR #4327)

Date: 2025-12-16
Platform: Hercules (MSU)
CI Pipeline: #6388
PR: #4327
Commit: a05fd274
Gist: Error Log
Slurm Job: 7397598 on hercules-02-55
Analysis Method: EIB MCP-RAG GraphRAG Toolset (v3.6.2, 42 tools)


Executive Summary

The JGLOBAL_ENKF_SFC job (ensemble surface analysis) failed with "Failed to update surface fields! RETURN CODE 1" during the C96C48_hybatmDA CI test case on Hercules. Root cause: the RUNTESTS/COMROOT/date/t00z file was missing, causing setpdy.sh to fail silently. The jjob_header.sh error-suppression pattern (|| true) swallowed this critical failure, allowing the job to proceed with an uninitialized date environment.


1. Error Chain (Chronological)

Step 1 — setpdy.sh fails (line 57)

setpdy.sh[57] sed 's/[0-9]\{8\}/20211221/' .../RUNTESTS/COMROOT/date/t00z
sed: can't read .../RUNTESTS/COMROOT/date/t00z: No such file or directory

setpdy.sh uses $COMDATEROOT/date/$cycle to create the PDY file. In production, this file is maintained by the NCO date-setting cron (updated at 11:30 and 23:30 UTC). In CI/CD, the test harness is responsible for provisioning RUNTESTS/COMROOT/date/t00z. This file was not created for the C96C48_hybatmDA test case.

Step 2 — jjob_header.sh swallows the error (lines 95–96)

setpdy.sh || true      # line 95: setpdy.sh fails, but || true suppresses
source ./PDY || true    # line 96: PDY file was never created, but || true suppresses

This is the critical design decision. The || true pattern exists because jjob_header.sh expects PDY to already be exported by the parent environment in CI contexts (the CI framework pre-sets $PDY via ecFlow/Rocoto). However, when setpdy.sh also fails to create the ./PDY file, the date variables PDYm7PDYp7 are never defined.

Despite the suppressed failure, $PDY was already set (20211221) by the CI driver before jjob_header.sh was called. So the primary PDY variable was available, but the extended date range variables were not.

Step 3 — Job proceeds into exglobal_enkf_sfc.sh

The script sources config.base and config.esfc, sets up ensemble member directories, and enters the regrid_gsiSfcIncr_to_tile.sh processing loop for mem001 and mem002.

Step 4 — Surface field update fails

Failed to update surface fields! RETURN CODE 1

The global_cycle executable (called via MPMD through regrid_gsiSfcIncr_to_tile.sh) returned a non-zero exit code. The exact cause within global_cycle is not visible in the trace, but the missing date environment may have caused incorrect file paths for first-guess or analysis increment files.

Step 5 — Job cancelled

cat: OUTPUT.909007: No such file or directory
slurmstepd: error: *** JOB 7397598 ON hercules-02-55 CANCELLED AT 2025-12-16T10:11:13 ***

The pgmout file (OUTPUT.909007) was never written, confirming the job did not complete normally.


2. Root Cause Analysis

Primary Cause: Missing COMROOT/date/t00z in CI test environment

The CI test harness (dev/ci/) is responsible for creating the RUNTESTS/COMROOT/ directory tree. The date/t00z file is expected at:

/work2/noaa/global/role-global/GFS_CI_CD/HERCULES/BUILDS/GITLAB/
  pr_cases_4327_a05fd274_6388/RUNTESTS/COMROOT/date/t00z

This file was not provisioned by the test setup stage.

Contributing Factor: Silent error suppression in jjob_header.sh

The || true pattern on lines 95–96 of jjob_header.sh is intentional — it exists because:

  1. In the CI/CD pipeline, PDY is pre-exported by the workflow driver (ecFlow/Rocoto), so setpdy.sh is redundant
  2. The COMDATEROOT path may not exist in development/test environments
  3. Failing here would break every CI job even when PDY is correctly set externally

However, the pattern masks a real diagnostic signal. When setpdy.sh fails AND PDY is set externally, the extended date variables (PDYm1, PDYp1, etc.) are still missing, which can cascade into downstream failures.

Secondary Cause: global_cycle failure

The surface field update via global_cycle returned code 1. Based on the directory listing, the input fix files (grids, vegetation types, orography) were staged correctly. The most likely failure point is either:

  • Missing or malformed surface increment files from the upstream JGDAS_ENKF_UPDATE step
  • A date-dependent path resolution failure within global_cycle itself

3. Affected Components

Component Role Status
ush/jjob_header.sh Universal j-job header, runs setpdy.sh, sources configs Error suppressed (lines 95-96)
ush/setpdy.sh (prod_util) Creates PDY date file from $COMDATEROOT/date/$cycle Failed — missing input file
dev/jobs/JGLOBAL_ENKF_SFC J-job driver for ensemble surface analysis Invoked correctly
scripts/exglobal_enkf_sfc.sh Ex-script for enkf surface analysis Ran but failed at surface update
ush/regrid_gsiSfcIncr_to_tile.sh Regrids GSI surface increments to FV3 tiles Executed MPMD, returned code 1
config.esfc Configuration: DOSFCANL_ENKF, IAU settings, GCYCLE Sourced successfully
dev/ci/ test harness CI test provisioning Did not create COMROOT/date/

4. Environment Snapshot

Variable Value
PDY 20211221 (pre-set by CI driver)
cyc 00
cycle t00z
RUN enkfgdas
CASE_ENS C48
COMDATEROOT .../RUNTESTS/COMROOT
DOSFCANL_ENKF NO (when DOIAU_ENKF=YES)
NMEM_ENS 2 (mem001, mem002)
machine HERCULES
ACCOUNT fv3-cpu
job esfc
jobid esfc.908890
Module stack 78 modules including gw_run.hercules, prod_util/2.1.1

5. Recommendations

Immediate (for PR #4327)

  1. Verify upstream test provisioning: Check that the CI setup_tests stage creates RUNTESTS/COMROOT/date/t{HH}z for all cycles
  2. Rerun the failed test: If this is a transient provisioning issue (e.g., race condition in parallel test setup), a rerun may succeed
  3. Check upstream JGDAS_ENKF_UPDATE: Verify that the surface increment files (enkfgdas.t00z.increment.sfc.i00{3,6,9}.nc) exist in COMROOT

Systemic (long-term)

  1. Add diagnostic logging to jjob_header.sh: Before the || true guards, log when setpdy.sh fails:
    setpdy.sh || echo "[WARN] setpdy.sh failed; relying on pre-exported PDY=${PDY:-UNSET}" >&2
    source ./PDY || echo "[WARN] PDY file not found; extended date vars unavailable" >&2
    
  2. Add COMROOT/date/ creation to CI provisioning: In dev/ci/scripts/utils/ci_utils.sh (confirmed via find_env_dependencies as a COMROOT-dependent script), add explicit date file creation during test setup
  3. EE2 compliance note: Per EE2 standards, setpdy.sh requires $COMDATEROOT/date/$cycle — CI environments should honor this contract even when PDY is pre-set, to ensure PDYm*/PDYp* variables are available

6. MCP-RAG Tools Used for This Analysis

Tool Parameters Contribution
get_job_details job_name: "JGLOBAL_ENKF_SFC" Retrieved job structure: 71-line J-job, sources jjob_header.sh -e esfc -c "base esfc", identified env vars (GDUMP, CASE, USE_CFP), retrieved config.esfc content showing IAU/DOSFCANL_ENKF logic
explain_workflow_component component: "jjob_header.sh" Confirmed role as universal j-job header
find_env_dependencies variable_name: "PDY" Found 50 j-jobs depend on PDY; classified as EE2 Standard; GGSR analysis identified regrid_gsiSfcIncr_to_tile.sh as a PDY-dependent script in the blast radius
find_env_dependencies variable_name: "COMROOT" Found 2 scripts depend on COMROOT: ci_utils.sh (CI provisioning) and bash_utils.sh; confirmed LOW impact but critical for CI setup
search_documentation query: "setpdy.sh date initialization PDY COMROOT" Retrieved EE2 docs (ee-docs 11.0.0) explaining setpdy.sh contract: requires $COMDATEROOT/date/$cycle, creates PDY file with PDYm7PDYp7 range, date files set at 11:30/23:30 UTC
describe_component component: "scripts/exglobal_enkf_sfc.sh" Retrieved full 280-line script: confirmed DONST, DOSFCANL_ENKF, CASE, ntiles variables; showed NMEM_ENS loop over members with global_cycle invocation
describe_component component: "ush/jjob_header.sh" Retrieved 120-line source: confirmed setpdy.sh || true / source ./PDY || true pattern at lines 95-96; documented required env vars ($HOMEgfs, $DATAROOT, $jobid, $PDY, $cyc, $machine)
get_code_context symbol: "regrid_gsiSfcIncr_to_tile.sh" GGSR analysis: hop-1 exports in_dir, pgm; hop-2 connects to wave scripts via shared env vars; confirmed shell community 2704 membership
get_change_impact symbol: "exglobal_enkf_sfc.sh" LOW risk (0.10), 0 direct dependents (leaf node); confirmed community 2704 (32-node shell cluster with enkf scripts)
search_architecture query: "enkf surface analysis global_cycle setpdy jjob_header error handling" Identified community 2704 as the enkf shell community; found related Fortran communities (BUMP, GSI)
get_operational_guidance operation: "CI/CD test case setup COMROOT initialization", platform: "hercules" Retrieved CTest framework docs: cmake config with STAGED_CTESTS, GitLab CI pipeline stages (build → setup_tests → run_tests), Hercules-specific notes (Tier 1, no TC Tracker)
search_issues query: "setpdy PDY COMROOT date surface enkf" No matching issues found — confirms this is likely a new/unreported issue pattern
find_callers_callees function_name: "global_cycle" No graph edges found — global_cycle is a compiled Fortran executable, not a shell function; this is expected
find_dependencies target: "scripts/exglobal_enkf_sfc.sh" No import graph edges — confirms ex-scripts are invoked by j-jobs, not imported; runtime dependency via source

7. GraphRAG Analysis Summary

The GGSR (Graph-Guided Semantic Retrieval) analysis provided the following structural insights:

  • PDY impact radius: 50 j-jobs directly depend on PDY — making it one of the highest-connectivity environment variables in the workflow. A setpdy.sh failure affects multiple downstream tasks beyond JGLOBAL_ENKF_SFC.
  • Community 2704: The enkf scripts (exglobal_enkf_sfc.sh, exgdas_enkf_ecen.sh, exgdas_enkf_post.sh, exgdas_enkf_update.sh) form a tightly-coupled 32-node shell community with 50 internal SOURCES relationships. Changes to shared infrastructure like jjob_header.sh have cross-community impact.
  • COMROOT has low direct connectivity (2 scripts) but is a critical CI/CD provisioning variable. The ci_utils.sh dependency confirms it is part of the test harness.
  • No prior issues found: The search_issues tool returned no results for this error pattern, suggesting either (a) the COMROOT/date provisioning was recently changed, or (b) this failure mode is specific to the C96C48_hybatmDA test case configuration.

Analysis generated using EIB MCP-RAG Server v3.6.2 with 14 tool invocations across 5 tool modules (Workflow Info, Code Analysis, Semantic Search, GraphRAG, Operational). Total analysis time: ~45 seconds.