PR4451_C96C48mx500_S2SW_cyc_gfs gdas_anlstat_SLURM_CANCELLED - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

PR #4451 C96C48mx500_S2SW_cyc_gfs β€” gdas_anlstat SLURM JOB CANCELLED Failure

Test Case: C96C48mx500_S2SW_cyc_gfs (S2SW Coupled Cycling with GFS)
Job: JGLOBAL_ANALYSIS_STATS β†’ exglobal_analysis_stats.py
Platform: URSA (node u01c30)
PR: #4451 β€” "Speed up tarring of the gsidiags"
Author: aerorahul
Commit: 53dc4019 (case ID 7396)
Date: January 20, 2026 18:57:11–18:57:24 UTC
Log Source: CI error log (local file)
Status: PR CLOSED (CI-Ursa-Failed, CI-Hercules-Failed)
Analysis Date: February 10, 2026
MCP Server: v3.6.2 (42 tools, Python GraphRAG validation)


Executive Summary

The gdas_anlstat (analysis statistics) job was externally cancelled by SLURM during CI testing of PR #4451 on URSA after running for only 4 seconds. The Python script exglobal_analysis_stats.py had barely begun initialization β€” logging only BEGIN: AnalysisStats.__init__ before the SLURM scheduler terminated the entire job allocation:

slurmstepd: error: *** JOB 7652445 ON u01c30 CANCELLED AT 2026-01-20T18:57:24 ***

Root Cause: External SLURM job cancellation β€” not a code bug in the anlstat task itself. The cancellation was likely triggered by an upstream dependency failure or a CI pipeline-level timeout/abort decision. PR #4451's changes to exglobal_atmos_analysis.sh (parallel tarring of GSI diags via MPMD) may have caused earlier job failures that cascaded to cancel downstream dependent jobs including anlstat.

Classification: Infrastructure / CI Pipeline β€” External Job Cancellation


Failure Chain

CI Pipeline (PR #4451 β€” C96C48mx500_S2SW_cyc_gfs)
  β”‚
  β”œβ”€β”€ [Upstream jobs β€” likely analysis/diag tasks]  ❌ (probable failure)
  β”‚     └── Changes to exglobal_atmos_analysis.sh (parallel tarring)
  β”‚
  └── JGLOBAL_ANALYSIS_STATS (SLURM Job 7652445)
        β”‚
        β”œβ”€β”€ jjob_header.sh β†’ Sources config.base, config.anlstat, URSA.env
        β”‚     β”œβ”€β”€ setpdy.sh β†’ WARNING: COMROOT/date/t18z missing (non-fatal)
        β”‚     └── ./PDY β†’ ERROR: No such file (non-fatal, suppressed)
        β”‚
        β”œβ”€β”€ config.base β†’ S2SW coupled app, DOIAU=YES, DO_JEDISNOWDA=YES
        β”œβ”€β”€ config.anlstat β†’ TASK_CONFIG_YAML=.../anlstat_config.yaml.j2
        β”œβ”€β”€ URSA.env β†’ APRUN_ANLSTAT="srun -n 20 --cpus-per-task=1"
        β”‚
        └── exglobal_analysis_stats.py  ⚠️ STARTED BUT CANCELLED
              β”œβ”€β”€ Logger("DEBUG") initialized  βœ…
              β”œβ”€β”€ AnalysisStats.__init__()  ⏳ IN PROGRESS (18:57:20)
              β”‚
              └── *** CANCELLED BY SLURM *** (18:57:24, +4 seconds)

1. Detailed Analysis

1.1 Job Identity

Field Value
Job Name anlstat
Job ID anlstat.1551206
SLURM Job ID 7652445
Process ID 1551894
Node u01c30
Cycle t18z
PDY 20211220
RUN gdas
APP S2SW
CASE C96
CASE_ENS C48
OCNRES 500

1.2 Resource Allocation

From config.resources and URSA.env:

Resource Value
walltime 00:30:00
ntasks 20
threads_per_task 1
tasks_per_node 20
memory 150GB
max_tasks_per_node 192
OMP_STACKSIZE 2048000
launcher srun -l --export=ALL --hint=nomultithread --distribution=block:block

1.3 Python Environment

Component Version
Python 3.11.7
spack-stack 1.9.2 (ue-oneapi-2024.2.1)
GDAS Module GDAS/ursa.intel
Intel Compilers oneapi/2024.2.1
Intel MPI 2021.13.1
numpy 1.26.4
xarray 2024.7.0
scipy 1.14.1
netCDF4 1.7.1.post2

PYTHONPATH includes:

  • sorc/wxflow/src (wxflow library)
  • sorc/gdas.cd/build/lib/python3.11/ (pyioda)
  • sorc/gdas.cd/build/lib/python3.11/site-packages/ (bufr tools)
  • ush/python (workflow utilities)
  • 30+ spack-stack site-packages paths

1.4 What the Script Does

exglobal_analysis_stats.py is a Python ex-script that:

  1. Imports wxflow.Logger, wxflow.cast_strdict_as_dtypedict, and pygfs.task.analysis_stats.AnalysisStats
  2. Converts the full shell environment to a Python config dictionary via cast_strdict_as_dtypedict(os.environ)
  3. Builds a STAT_ANALYSES list based on DA configuration flags:
    • DO_AERO_ANL=NO β†’ no aero (this run)
    • DO_JEDISNOWDA=YES β†’ appends 'snow'
    • DO_JEDIATMVAR=NO β†’ appends 'atmos_gsi' (GSI-based)
  4. Instantiates AnalysisStats(config) β€” THIS IS WHERE CANCELLATION OCCURRED
  5. If atmos_gsi in list: calls AnlStats.convert_gsi_diags() β€” converts GSI binary diags to IODA format
  6. Calls AnlStats.initialize() β€” stages runtime directory, renders YAML config
  7. Loops: AnlStats.execute(anl) and AnlStats.finalize(anl) for each analysis type

1.5 Non-Fatal Warnings in Log

Warning Location Impact
COMROOT/date/t18z: No such file setpdy.sh:57 Non-fatal β€” CI environment lacks pre-staged date file
./PDY: No such file or directory jjob_header.sh:96 Non-fatal β€” suppressed by `

These are expected in CI and do not affect job execution.


2. Python GraphRAG Analysis (Phase 24I Validation)

This analysis served as a comprehensive validation of the newly updated Python GraphRAG. All tools were tested against this Python-centric workflow.

2.1 Code Context (GGSR)

get_code_context("exglobal_analysis_stats") returned:

Entity Relationship Weight Hop
finalize CALLS 1.000 1
execute CALLS 1.000 1
JGLOBAL_ANALYSIS_STATS INVOKES 0.900 1
exglobal_snowens_analysis CALLS→CALLS 0.150 2
exglobal_snow_analysis CALLS→CALLS 0.150 2
exglobal_marine_analysis_* CALLS→CALLS 0.150 2
(12 more hop-2 siblings)

Latency: 88ms | Hop1: 3 | Hop2: 12

2.2 Data Flow Trace

trace_data_flow("exglobal_analysis_stats") revealed 10 outgoing relationships:

Target Type Relationship
AnalysisStats PythonFunction CALLS
Logger PythonFunction CALLS
cast_strdict_as_dtypedict PythonFunction CALLS
convert_gsi_diags PythonFunction CALLS
execute PythonFunction CALLS
finalize PythonFunction CALLS
initialize PythonFunction CALLS
os PythonModule IMPORTS
pygfs.task.analysis_stats PythonModule IMPORTS
wxflow PythonModule IMPORTS

2.3 Caller/Callee Analysis

Function Fan-In Fan-Out Complexity
AnalysisStats 1 (exglobal_analysis_stats) 0 LOW (0)
cast_strdict_as_dtypedict 62 (all Python ex-scripts) 2 HIGH (124)
Logger 88 (workflow-wide) 0 LOW (0)
convert_gsi_diags 1 (exglobal_analysis_stats) 10 LOW (10)
initialize 173 (workflow-wide) 100 HIGH (17300)

2.4 Dependency Chain

find_dependencies("exglobal_analysis_stats.py"):

  • Hop 1: CALLS β†’ finalize, execute; INVOKES ← JGLOBAL_ANALYSIS_STATS
  • Hop 2: 17 sibling Python ex-scripts that share the finalize/execute pattern

2.5 Execution Path

trace_execution_path("JGLOBAL_ANALYSIS_STATS"):

  • INVOKES β†’ exglobal_analysis_stats
  • DEPENDS_ON_ENV β†’ 15 environment variables (COMOUT_CONF, HOMEgfs, COMOUT_SNOW_ANALYSIS, cyc, pgmout, DO_AERO_ANL, PDY, DO_JEDISNOWDA, COMOUT_ATMOS_ANLMON, etc.)

2.6 Change Impact

get_change_impact("exglobal_analysis_stats"):

  • Risk Level: LOW (0.15)
  • Direct Dependents: 1 (JGLOBAL_ANALYSIS_STATS)

3. Root Cause Assessment

Primary Cause: External SLURM Cancellation

The job was not killed by a code error, timeout, or resource limit. Key evidence:

  1. Duration: Only 4 seconds (18:57:20 β†’ 18:57:24) vs. 30-minute walltime allocation
  2. Error type: slurmstepd: error: *** JOB CANCELLED *** β€” this is SLURM's external cancellation signal, not an OOM kill (SIGKILL) or timeout (SIGTERM)
  3. Python output: The script successfully imported all modules and started AnalysisStats.__init__() β€” no Python errors
  4. No FATAL ERROR: Unlike typical workflow failures, there is no err_exit or "FATAL ERROR" from the script itself

Probable Trigger: Upstream CI Failure Cascade

PR #4451 ("Speed up tarring of the gsidiags") modifies exglobal_atmos_analysis.sh to:

  • Tar individual dir.???? directories in parallel using MPMD
  • Concatenate tarballs afterward
  • Add profiling (tick/tock) to shell scripts
  • Route stderr into mpmd.?.out files

If the upstream atmos_analysis or atmos_analysis_diag job failed (e.g., due to MPMD tarring issues), the CI pipeline manager (Rocoto) would cancel all remaining downstream jobs including anlstat.

Contributing Factors

  1. CI-Ursa-Failed and CI-Hercules-Failed labels on PR #4451 confirm systemic CI failure
  2. PR was closed (not merged) on 2026-01-20 β€” same day as this failure
  3. The anlstat job depends on analysis outputs from upstream jobs that PR #4451 modifies

4. PR #4451 Context

Field Value
Title Speed up tarring of the gsidiags
Author aerorahul
Created January 16, 2026
Closed January 20, 2026
State closed (not merged)
Labels CI-Ursa-Failed, CI-Hercules-Failed
Changed Files exglobal_atmos_analysis.sh (parallel tarring), shell profiling additions
Test Case C96C46_hybatmDA (initial), C96C48mx500_S2SW_cyc_gfs (CI)

Key PR Changes

  • Parallel tarring of GSI diagnostic files using MPMD
  • Shell profiling via tick/tock timing functions
  • stderr redirection to mpmd.?.out files via mpiexec
  • Concatenation of individual tarballs into final archive

5. Recommendations

5.1 For the SLURM Cancellation

  1. Examine upstream job logs β€” Check the atmos_analysis and atmos_analysis_diag logs from the same CI run to identify the root upstream failure
  2. Verify MPMD tarring β€” The parallel tarring in MPMD may have race conditions or filesystem contention, especially on URSA's shared scratch

5.2 For PR #4451

  1. Investigate CI failures on both platforms β€” CI-Ursa-Failed and CI-Hercules-Failed suggests the issue is in the code, not platform-specific
  2. Test MPMD tarring isolation β€” Run the parallel tarring step independently to verify it works with the URSA SLURM configuration
  3. Check mpiexec stderr routing β€” The change to route stderr to mpmd.?.out may suppress critical errors visible in the main log

5.3 For Python Workflow Robustness

  1. Add Python signal handling β€” The AnalysisStats class should catch SIGTERM/SIGINT for graceful shutdown on SLURM cancellation
  2. Add startup health check β€” Log completion of __init__ and key config validation before proceeding to compute-intensive work
  3. EE2 Compliance note β€” Python ex-scripts should follow the same error-handling contract as shell ex-scripts (descriptive FATAL ERROR: messages)

6. Affected Components

Component Path Role
JGLOBAL_ANALYSIS_STATS dev/jobs/JGLOBAL_ANALYSIS_STATS J-Job wrapper (65 lines)
exglobal_analysis_stats.py dev/scripts/exglobal_analysis_stats.py Python ex-script (1,495 bytes)
AnalysisStats pygfs.task.analysis_stats Python task class
cast_strdict_as_dtypedict wxflow Env→dict converter (62 callers)
Logger wxflow Structured logging (88 callers)
convert_gsi_diags pygfs.task.analysis_stats GSI diag converter
config.anlstat EXPDIR/config.anlstat Task config (sets TASK_CONFIG_YAML)
config.resources EXPDIR/config.resources Resource allocation
URSA.env env/URSA.env Platform environment
anlstat_config.yaml.j2 parm/gdas/anlstat/ YAML template for JEDI stats
exglobal_atmos_analysis.sh scripts/ Modified by PR #4451 (upstream)

7. MCP Tool Coverage Report

This analysis utilized 35+ MCP tool invocations across all 7 tool modules, serving as a comprehensive validation of the Python GraphRAG update (Phase 24I):

Tool Module Coverage

Module Tools Used Status
WorkflowInfoTools get_workflow_structure, get_system_configs, describe_component βœ… All working
CodeAnalysisTools analyze_code_structure, find_dependencies, find_callers_callees, find_env_dependencies, trace_execution_path βœ… All working (Python CALLS/IMPORTS visible)
SemanticSearchTools search_documentation, explain_with_context, find_related_files, get_knowledge_base_status, list_ingested_urls βœ… All working
EE2ComplianceTools search_ee2_standards, analyze_ee2_compliance βœ… Working
OperationalTools get_operational_guidance, list_job_scripts, explain_workflow_component βœ… Working
GraphRAGTools get_code_context, search_architecture, find_similar_code, get_change_impact, trace_data_flow βœ… All working (GGSR with Python nodes)
GitHubTools search_issues, get_pull_requests, analyze_repository_structure, analyze_workflow_dependencies βœ… Working
SDDTools list_sdd_workflows, get_sdd_workflow, get_sdd_framework_status βœ… Working (45 workflows, Phase 24I confirmed)
HealthTools mcp_health_check, get_server_info βœ… Healthy (42 tools, 60,404 docs, 483,754 rels)

Python GraphRAG Validation Summary

Feature Status Evidence
Python CALLS relationships βœ… trace_data_flow shows 7 CALLS from exglobal_analysis_stats
Python IMPORTS relationships βœ… 3 IMPORTS (os, pygfs.task.analysis_stats, wxflow)
Cross-language INVOKES βœ… Shell J-Job INVOKES Python ex-script
Python Fan-in analysis βœ… cast_strdict_as_dtypedict: 62 callers, Logger: 88 callers
GGSR hop traversal βœ… Hop1=3, Hop2=12-17 entities at 77-88ms latency
Community detection βœ… Communities 57, 69, 1616, 3595, 3628 reachable

8. Log Evidence Summary

Timestamp Event Status
18:57:11 JGLOBAL_ANALYSIS_STATS begins βœ…
18:57:11-18:57:19 jjob_header.sh sources configs, sets up environment βœ…
18:57:11 setpdy.sh β€” COMROOT/date/t18z not found ⚠️ Non-fatal
18:57:11 ./PDY β€” No such file ⚠️ Non-fatal
18:57:11 config.base sourced β€” S2SW, DOIAU=YES, 128 LEVS βœ…
18:57:11 config.anlstat β€” TASK_CONFIG_YAML set βœ…
18:57:11 URSA.env β€” APRUN_ANLSTAT configured βœ…
18:57:19 COM directories created (atmos, snow analysis/anlmon) βœ…
18:57:19 EXSCRIPT set to exglobal_analysis_stats.py βœ…
18:57:20 Python: BEGIN: AnalysisStats.__init__ βœ…
18:57:24 slurmstepd: error: *** JOB 7652445 ON u01c30 CANCELLED *** ❌

Total execution time: 13 seconds (job start to SLURM cancellation)
Python execution time: 4 seconds (before external kill)


9. Knowledge Base Statistics at Time of Analysis

Category Count
ChromaDB Documents 60,404
ChromaDB Collections 5
Neo4j Files 2,744
Neo4j Functions 1,540
Neo4j Classes 54
Neo4j Relationships 483,754
Shell Scripts 314 (89 J-Jobs, 6 Ex-Scripts, 6 USH)
Environment Variables 2,730
SDD Workflows 45
MCP Tools 42