PR4327_C96C48_hybatmDA enkfgdas_esfc_GCYCLE_DATE_UNBOUND - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

PR #4327 C96C48_hybatmDA — enkfgdas_esfc gcycle_date: unbound variable Failure

Test Case: C96C48_hybatmDA (Hybrid Atmospheric Data Assimilation)
Job: JGLOBAL_ENKF_SFCexglobal_enkf_sfc.shglobal_cycle.sh
Platform: Hercules (node hercules-02-55)
PR: #4327 — "Fix two bugfixes to global_cycle.sh"
Author: ClaraDraper-NOAA
Commit: a05fd274
Date: December 16, 2025 16:10–16:11 CST
Log Source: Gist — enkfgdas_esfc.log
Status: RESOLVED (subsequent PR commits fixed the issue; CI-Hercules-Passed after fix)
Analysis Date: February 10, 2026


Executive Summary

The enkfgdas_esfc (EnKF surface update) job failed during CI testing of PR #4327 on Hercules with:

global_cycle.sh: line 281: gcycle_date: unbound variable
FATAL ERROR: Failed to update surface fields! RETURN CODE 1

Root Cause: PR #4327 modified global_cycle.sh to use a new variable gcycle_date for time extraction in the NAMCYC namelist (to fix the time to match the restart time instead of the analysis time), but the initial commit (a05fd274) did not include the corresponding export gcycle_date= in the calling script exglobal_enkf_sfc.sh. Under bash set -u (nounset), the unset variable triggered an immediate abort.

Classification: Code Bug — Incomplete Cross-Script Variable Interface Update


Failure Chain

JGLOBAL_ENKF_SFC (J-Job)
  │
  ├── jjob_header.sh → Sources config.base, config.esfc, HERCULES.env
  │     └── setpdy.sh → WARNING: COMROOT/date/t00z missing (non-fatal)
  │     └── ./PDY → WARNING: No such file (non-fatal)
  │
  └── exglobal_enkf_sfc.sh
        │
        ├── regrid_gsiSfcIncr_to_tile.sh  ✅ exit 0 (3 seconds)
        │     ├── run_mpmd.sh (cmdfile_in)  ✅ exit 0 (cpreq orog/grid files)
        │     ├── regridStates.x (srun -n 12)  ✅ exit 0 (soil/snow increments)
        │     └── run_mpmd.sh (cmdfile_out)  ✅ exit 0 (cpfs output tiles)
        │
        ├── [MISSING: export gcycle_date=...]  ⚠️  NOT EXECUTED
        │
        └── global_cycle.sh  ❌ exit 1
              └── line 281: ${gcycle_date:0:4}  →  "unbound variable"
              └── err_exit "Failed to update surface fields!" RETURN CODE 1

Detailed Analysis

1. Primary Error: gcycle_date: unbound variable

Location: global_cycle.sh line 281
Error Type: bash set -u (nounset) violation

The failing code in global_cycle.sh extracts date components from gcycle_date:

# global_cycle.sh lines 281-284
iy=${gcycle_date:0:4}    # year
im=${gcycle_date:4:2}    # month
id=${gcycle_date:6:2}    # day
ih=${gcycle_date:8:2}    # hour

These are used in the NAMCYC namelist (fort.36) passed to the global_cycle Fortran executable:

&NAMCYC
  idim=48, jdim=48, lsoil=4,
  iy=${iy}, im=${im}, id=${id}, ih=${ih}, fh=0,
  ...
/

Why PR #4327 introduced gcycle_date: The PR fixes issue #4326 — the surface analysis namelist was using the analysis time (PDY/cyc) when it should use the restart time. The variable gcycle_date was introduced as a caller-provided override to pass the correct time.

2. Missing Variable Export in Calling Script

MCP Tool find_env_dependencies("gcycle_date") revealed:

Script Role Line Value
exglobal_enkf_sfc.sh Exporter 213 ${bPDY}${bcyc} (IAU beginning-of-window time)
exglobal_enkf_sfc.sh Exporter 291 ${PDY}${cyc} (analysis center time)
exglobal_atmos_sfcanl.sh Exporter 158 ${gcycle_dates[hr]} (per-hour in loop)
global_cycle.sh Consumer 281 ${gcycle_date:0:4} (substring extraction)

In the initial commit a05fd274, the export gcycle_date= lines at positions 213 and 291 of exglobal_enkf_sfc.sh were not yet present. The PR only modified global_cycle.sh to consume the variable without updating the caller to provide it.

The variable was later added in a follow-up commit, after which CI-Hercules passed.

3. Execution Context

Config loaded from log:

Parameter Value Source
CASE C48 (ensemble resolution) config.base line 207
CASE_ENS C48 config.base line 207
NMEM_ENS 2 (CI runs with 2 members) config
DOIAU_ENKF YES config.base line 356
IAUFHRS_ENKF 3,6,9 config.base line 357
GCYCLE_DO_SOILINCR .false. config.esfc line 37
DO_GSISOILDA NO default
DONST YES config
APP ATM config.base line 156
spack-stack 1.9.2 (ue-oneapi-2024.1.0) modules
APRUN_CYCLE srun -l --export=ALL --hint=nomultithread -n 12 --cpus-per-task=1 HERCULES.env

Job timing:

  • Start: 16:10:57 CST
  • MPMD cmdfile_in: 16:11:05–16:11:08 (3s)
  • regridStates.x: completed successfully
  • MPMD cmdfile_out: 16:11:10–16:11:11 (1s)
  • global_cycle.sh: 16:11:13 — FATAL ERROR
  • Total wall time: ~16 seconds before failure

4. Pre-Existing Warnings (Non-Fatal)

Line Warning Impact
104 sed: can't read .../COMROOT/date/t00z: No such file or directory Non-fatal — setpdy.sh, handled by || true
107 ./PDY: No such file or directory Non-fatal — jjob_header.sh, handled by || true
929 Previous cycle snow file gdas.t18z.snogrb_t1534.3072.1536 missing Non-fatal — snow update disabled (FSNOL=99999.,FSNOS=99999.)
1336 I_MPI_EXTRA_FILESYSTEM_LIST environment variable is not supported Non-fatal — Intel MPI informational

5. MCP Tool Analysis Summary

MCP Tool Key Finding
get_job_details("JGLOBAL_ENKF_SFC") 71-line J-Job, configs: base + esfc, calls exglobal_enkf_sfc.sh via ENKFRESFCSH
describe_component("exglobal_enkf_sfc.sh") 11,432 bytes, ensemble surface analysis on tiles, calls regrid + global_cycle
describe_component("global_cycle.sh") 14,729 bytes, "pull script into global-workflow" refactor 2025-07-08 by Friedman
get_code_context("global_cycle") Depends on: ftst_land_increments, ftst_read_increments, global_cycle_lib
find_env_dependencies("gcycle_date") 2 exporters, 2 consumersexglobal_enkf_sfc.sh (line 213, 291), exglobal_atmos_sfcanl.sh (line 158), global_cycle.sh (line 281)
find_callers_callees("global_cycle") Entry point (0 callers in graph), leaf function (0 callees)
trace_execution_path("JGLOBAL_ENKF_SFC") 15 env dependencies: GDUMP_ENS, GDATE, ENKFRESFCSH, CASE_ENS, assim_freq, etc.
find_dependencies("exglobal_enkf_sfc.sh") Exports: OMP_NUM_THREADS_CY, CYCLEXEC, FNACNA, CASE_IN, err → global_cycle.sh at hop 2
search_pull_requests("4327") Merged 2025-12-18, CI-Hercules-Passed (after fix), CI-Gaeac6-Failed (separate issue)
search_ee2_standards("unbound variable...") EE2 requires descriptive error messages with "FATAL ERROR:" prefix — compliant
get_operational_guidance(...) EE2 standards: recovery capability for jobs >15min, separate post-processing jobs

Root Cause Assessment

Primary: Incomplete Cross-Script Variable Interface Change

PR #4327 introduced gcycle_date as a new caller-to-callee variable contract between:

  • Callers: exglobal_enkf_sfc.sh, exglobal_atmos_sfcanl.sh
  • Callee: global_cycle.sh

The initial commit (a05fd274) only updated the callee (global_cycle.sh) to consume gcycle_date but did not update all callers to export it. This is a classic interface contract violation — the producer side of the contract was not updated when the consumer side was changed.

Contributing: bash set -u Enforcement

The global-workflow uses set -u (nounset) via preamble.sh, which correctly catches unset variables as errors. This is the intended behaviorset -u successfully prevented a silent failure where iy, im, id, ih would have been empty strings, leading to malformed NAMCYC namelist and potentially incorrect Fortran execution.

Why This Was NOT Caught Earlier

  • exglobal_atmos_sfcanl.sh (the non-EnKF caller) already hadgcycle_date exports at the time of the PR, so the sfcanl path would not have failed
  • The EnKF path (exglobal_enkf_sfc.sh) is only exercised by the C96C48_hybatmDA CI test, not the simpler C48_ATM tests
  • The original PR commit only changed global_cycle.sh, creating an asymmetric update

Recommendations

Immediate (Applied — PR #4327 Fixed)

  1. Add export gcycle_date to exglobal_enkf_sfc.sh — Lines 213 and 291 now export the variable before calling ${CYCLESH}. ✅ Applied in subsequent commit.

Short-Term

  1. Add variable validation at entry to global_cycle.sh — Check required variables at script start:

    # At top of global_cycle.sh, after variable initializations
    : "${gcycle_date:?ERROR: gcycle_date must be set by caller}"
    

    This provides a clearer error message than the generic "unbound variable" from set -u.

  2. Add shellcheck CI step for cross-script variable usageshellcheck with source= directives can detect unset variable usage across sourced/called scripts.

  3. Document the variable contract — Add a comment block in global_cycle.sh listing required caller-provided variables:

    # Required from caller: gcycle_date (YYYYMMDDHH format)
    #   - exglobal_enkf_sfc.sh: exports as ${bPDY}${bcyc} or ${PDY}${cyc}
    #   - exglobal_atmos_sfcanl.sh: exports as ${gcycle_dates[hr]}
    

Long-Term

  1. Implement interface tests for shell script variable contracts — Create a test that verifies all callers of global_cycle.sh export all required variables before invocation.

  2. Consider passing gcycle_date as a function argument — Refactor global_cycle.sh to accept the date as a positional argument rather than relying on environment variable inheritance, reducing the risk of missing exports.

  3. Expand CI test matrix — Ensure that both IAU (DOIAU_ENKF=YES) and non-IAU paths are tested in at least one CI case, as the two paths have different gcycle_date values (bPDY/bcyc vs PDY/cyc).


Affected Files

File Role Lines
ush/global_cycle.sh Consumer (changed in PR) 281–284: gcycle_date substring extraction
scripts/exglobal_enkf_sfc.sh Exporter (missing in initial commit) 213: ${bPDY}${bcyc}, 291: ${PDY}${cyc}
scripts/exglobal_atmos_sfcanl.sh Exporter (already present) 158: ${gcycle_dates[hr]}
dev/jobs/JGLOBAL_ENKF_SFC J-Job wrapper Calls exglobal_enkf_sfc.sh
ush/regrid_gsiSfcIncr_to_tile.sh Upstream step Succeeded (exit 0)
config.esfc Config GCYCLE_DO_SOILINCR=.false.

Related Issues

Issue Title Relevance
#4326 global_cycle time mismatch Direct — PR #4327 resolves this
#4325 Soil moisture relaxation default Direct — PR #4327 resolves this
#4327 Fix two bugfixes to global_cycle.sh This PR — merged 2025-12-18

Log Evidence Summary

Log Line Content Significance
68 Begin JGLOBAL_ENKF_SFC at 16:10:57 Job start
104 sed: can't read .../COMROOT/date/t00z setpdy.sh warning (non-fatal)
107 ./PDY: No such file or directory jjob_header.sh warning (non-fatal)
929 WARNING: Previous cycle snow file ... is missing Snow update disabled for this cycle
1293 End run_mpmd.sh at 16:11:08 with error code 0 MPMD input staging succeeded
1704 End run_mpmd.sh at 16:11:11 with error code 0 MPMD output copy succeeded
1914 global_cycle.sh: line 281: gcycle_date: unbound variable PRIMARY FAILURE
1918 err_exit 'Failed to update surface fields!' Error handler invoked
1922 FATAL ERROR: Failed to update surface fields! RETURN CODE 1 EE2-compliant error message
1923 ABNORMAL EXIT at Tue Dec 16 10:11:13 CST 2025 on hercules-02-55 Job abort
1971 Failed to update surface fields! RETURN CODE 1 Final error propagation

Generated by EIB MCP-RAG analysis using 15+ MCP tools against global-workflow knowledge base (60,404 docs, 484,901 relationships)