PR4451_C96C48mx500_S2SW_cyc_gfs gdas_anlstat_SLURM_CANCELLED - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
PR #4451 C96C48mx500_S2SW_cyc_gfs β gdas_anlstat SLURM JOB CANCELLED Failure
Test Case: C96C48mx500_S2SW_cyc_gfs (S2SW Coupled Cycling with GFS)
Job: JGLOBAL_ANALYSIS_STATS β exglobal_analysis_stats.py
Platform: URSA (node u01c30)
PR: #4451 β "Speed up tarring of the gsidiags"
Author: aerorahul
Commit: 53dc4019 (case ID 7396)
Date: January 20, 2026 18:57:11β18:57:24 UTC
Log Source: CI error log (local file)
Status: PR CLOSED (CI-Ursa-Failed, CI-Hercules-Failed)
Analysis Date: February 10, 2026
MCP Server: v3.6.2 (42 tools, Python GraphRAG validation)
Executive Summary
The gdas_anlstat (analysis statistics) job was externally cancelled by SLURM during CI testing of PR #4451 on URSA after running for only 4 seconds. The Python script exglobal_analysis_stats.py had barely begun initialization β logging only BEGIN: AnalysisStats.__init__ before the SLURM scheduler terminated the entire job allocation:
slurmstepd: error: *** JOB 7652445 ON u01c30 CANCELLED AT 2026-01-20T18:57:24 ***
Root Cause: External SLURM job cancellation β not a code bug in the anlstat task itself. The cancellation was likely triggered by an upstream dependency failure or a CI pipeline-level timeout/abort decision. PR #4451's changes to exglobal_atmos_analysis.sh (parallel tarring of GSI diags via MPMD) may have caused earlier job failures that cascaded to cancel downstream dependent jobs including anlstat.
Classification: Infrastructure / CI Pipeline β External Job Cancellation
Failure Chain
CI Pipeline (PR #4451 β C96C48mx500_S2SW_cyc_gfs)
β
βββ [Upstream jobs β likely analysis/diag tasks] β (probable failure)
β βββ Changes to exglobal_atmos_analysis.sh (parallel tarring)
β
βββ JGLOBAL_ANALYSIS_STATS (SLURM Job 7652445)
β
βββ jjob_header.sh β Sources config.base, config.anlstat, URSA.env
β βββ setpdy.sh β WARNING: COMROOT/date/t18z missing (non-fatal)
β βββ ./PDY β ERROR: No such file (non-fatal, suppressed)
β
βββ config.base β S2SW coupled app, DOIAU=YES, DO_JEDISNOWDA=YES
βββ config.anlstat β TASK_CONFIG_YAML=.../anlstat_config.yaml.j2
βββ URSA.env β APRUN_ANLSTAT="srun -n 20 --cpus-per-task=1"
β
βββ exglobal_analysis_stats.py β οΈ STARTED BUT CANCELLED
βββ Logger("DEBUG") initialized β
βββ AnalysisStats.__init__() β³ IN PROGRESS (18:57:20)
β
βββ *** CANCELLED BY SLURM *** (18:57:24, +4 seconds)
1. Detailed Analysis
1.1 Job Identity
| Field | Value |
|---|---|
| Job Name | anlstat |
| Job ID | anlstat.1551206 |
| SLURM Job ID | 7652445 |
| Process ID | 1551894 |
| Node | u01c30 |
| Cycle | t18z |
| PDY | 20211220 |
| RUN | gdas |
| APP | S2SW |
| CASE | C96 |
| CASE_ENS | C48 |
| OCNRES | 500 |
1.2 Resource Allocation
From config.resources and URSA.env:
| Resource | Value |
|---|---|
| walltime | 00:30:00 |
| ntasks | 20 |
| threads_per_task | 1 |
| tasks_per_node | 20 |
| memory | 150GB |
| max_tasks_per_node | 192 |
| OMP_STACKSIZE | 2048000 |
| launcher | srun -l --export=ALL --hint=nomultithread --distribution=block:block |
1.3 Python Environment
| Component | Version |
|---|---|
| Python | 3.11.7 |
| spack-stack | 1.9.2 (ue-oneapi-2024.2.1) |
| GDAS Module | GDAS/ursa.intel |
| Intel Compilers | oneapi/2024.2.1 |
| Intel MPI | 2021.13.1 |
| numpy | 1.26.4 |
| xarray | 2024.7.0 |
| scipy | 1.14.1 |
| netCDF4 | 1.7.1.post2 |
PYTHONPATH includes:
sorc/wxflow/src(wxflow library)sorc/gdas.cd/build/lib/python3.11/(pyioda)sorc/gdas.cd/build/lib/python3.11/site-packages/(bufr tools)ush/python(workflow utilities)- 30+ spack-stack site-packages paths
1.4 What the Script Does
exglobal_analysis_stats.py is a Python ex-script that:
- Imports
wxflow.Logger,wxflow.cast_strdict_as_dtypedict, andpygfs.task.analysis_stats.AnalysisStats - Converts the full shell environment to a Python config dictionary via
cast_strdict_as_dtypedict(os.environ) - Builds a
STAT_ANALYSESlist based on DA configuration flags:DO_AERO_ANL=NOβ no aero (this run)DO_JEDISNOWDA=YESβ appends'snow'DO_JEDIATMVAR=NOβ appends'atmos_gsi'(GSI-based)
- Instantiates
AnalysisStats(config)β THIS IS WHERE CANCELLATION OCCURRED - If
atmos_gsiin list: callsAnlStats.convert_gsi_diags()β converts GSI binary diags to IODA format - Calls
AnlStats.initialize()β stages runtime directory, renders YAML config - Loops:
AnlStats.execute(anl)andAnlStats.finalize(anl)for each analysis type
1.5 Non-Fatal Warnings in Log
| Warning | Location | Impact |
|---|---|---|
COMROOT/date/t18z: No such file |
setpdy.sh:57 |
Non-fatal β CI environment lacks pre-staged date file |
./PDY: No such file or directory |
jjob_header.sh:96 |
Non-fatal β suppressed by ` |
These are expected in CI and do not affect job execution.
2. Python GraphRAG Analysis (Phase 24I Validation)
This analysis served as a comprehensive validation of the newly updated Python GraphRAG. All tools were tested against this Python-centric workflow.
2.1 Code Context (GGSR)
get_code_context("exglobal_analysis_stats") returned:
| Entity | Relationship | Weight | Hop |
|---|---|---|---|
finalize |
CALLS | 1.000 | 1 |
execute |
CALLS | 1.000 | 1 |
JGLOBAL_ANALYSIS_STATS |
INVOKES | 0.900 | 1 |
exglobal_snowens_analysis |
CALLSβCALLS | 0.150 | 2 |
exglobal_snow_analysis |
CALLSβCALLS | 0.150 | 2 |
exglobal_marine_analysis_* |
CALLSβCALLS | 0.150 | 2 |
| (12 more hop-2 siblings) |
Latency: 88ms | Hop1: 3 | Hop2: 12
2.2 Data Flow Trace
trace_data_flow("exglobal_analysis_stats") revealed 10 outgoing relationships:
| Target | Type | Relationship |
|---|---|---|
AnalysisStats |
PythonFunction | CALLS |
Logger |
PythonFunction | CALLS |
cast_strdict_as_dtypedict |
PythonFunction | CALLS |
convert_gsi_diags |
PythonFunction | CALLS |
execute |
PythonFunction | CALLS |
finalize |
PythonFunction | CALLS |
initialize |
PythonFunction | CALLS |
os |
PythonModule | IMPORTS |
pygfs.task.analysis_stats |
PythonModule | IMPORTS |
wxflow |
PythonModule | IMPORTS |
2.3 Caller/Callee Analysis
| Function | Fan-In | Fan-Out | Complexity |
|---|---|---|---|
AnalysisStats |
1 (exglobal_analysis_stats) | 0 | LOW (0) |
cast_strdict_as_dtypedict |
62 (all Python ex-scripts) | 2 | HIGH (124) |
Logger |
88 (workflow-wide) | 0 | LOW (0) |
convert_gsi_diags |
1 (exglobal_analysis_stats) | 10 | LOW (10) |
initialize |
173 (workflow-wide) | 100 | HIGH (17300) |
2.4 Dependency Chain
find_dependencies("exglobal_analysis_stats.py"):
- Hop 1: CALLS β
finalize,execute; INVOKES βJGLOBAL_ANALYSIS_STATS - Hop 2: 17 sibling Python ex-scripts that share the
finalize/executepattern
2.5 Execution Path
trace_execution_path("JGLOBAL_ANALYSIS_STATS"):
- INVOKES β
exglobal_analysis_stats - DEPENDS_ON_ENV β 15 environment variables (COMOUT_CONF, HOMEgfs, COMOUT_SNOW_ANALYSIS, cyc, pgmout, DO_AERO_ANL, PDY, DO_JEDISNOWDA, COMOUT_ATMOS_ANLMON, etc.)
2.6 Change Impact
get_change_impact("exglobal_analysis_stats"):
- Risk Level: LOW (0.15)
- Direct Dependents: 1 (
JGLOBAL_ANALYSIS_STATS)
3. Root Cause Assessment
Primary Cause: External SLURM Cancellation
The job was not killed by a code error, timeout, or resource limit. Key evidence:
- Duration: Only 4 seconds (18:57:20 β 18:57:24) vs. 30-minute walltime allocation
- Error type:
slurmstepd: error: *** JOB CANCELLED ***β this is SLURM's external cancellation signal, not an OOM kill (SIGKILL) or timeout (SIGTERM) - Python output: The script successfully imported all modules and started
AnalysisStats.__init__()β no Python errors - No
FATAL ERROR: Unlike typical workflow failures, there is noerr_exitor "FATAL ERROR" from the script itself
Probable Trigger: Upstream CI Failure Cascade
PR #4451 ("Speed up tarring of the gsidiags") modifies exglobal_atmos_analysis.sh to:
- Tar individual
dir.????directories in parallel using MPMD - Concatenate tarballs afterward
- Add profiling (
tick/tock) to shell scripts - Route
stderrintompmd.?.outfiles
If the upstream atmos_analysis or atmos_analysis_diag job failed (e.g., due to MPMD tarring issues), the CI pipeline manager (Rocoto) would cancel all remaining downstream jobs including anlstat.
Contributing Factors
- CI-Ursa-Failed and CI-Hercules-Failed labels on PR #4451 confirm systemic CI failure
- PR was closed (not merged) on 2026-01-20 β same day as this failure
- The
anlstatjob depends on analysis outputs from upstream jobs that PR #4451 modifies
4. PR #4451 Context
| Field | Value |
|---|---|
| Title | Speed up tarring of the gsidiags |
| Author | aerorahul |
| Created | January 16, 2026 |
| Closed | January 20, 2026 |
| State | closed (not merged) |
| Labels | CI-Ursa-Failed, CI-Hercules-Failed |
| Changed Files | exglobal_atmos_analysis.sh (parallel tarring), shell profiling additions |
| Test Case | C96C46_hybatmDA (initial), C96C48mx500_S2SW_cyc_gfs (CI) |
Key PR Changes
- Parallel tarring of GSI diagnostic files using MPMD
- Shell profiling via
tick/tocktiming functions stderrredirection tompmd.?.outfiles viampiexec- Concatenation of individual tarballs into final archive
5. Recommendations
5.1 For the SLURM Cancellation
- Examine upstream job logs β Check the
atmos_analysisandatmos_analysis_diaglogs from the same CI run to identify the root upstream failure - Verify MPMD tarring β The parallel tarring in MPMD may have race conditions or filesystem contention, especially on URSA's shared scratch
5.2 For PR #4451
- Investigate CI failures on both platforms β CI-Ursa-Failed and CI-Hercules-Failed suggests the issue is in the code, not platform-specific
- Test MPMD tarring isolation β Run the parallel tarring step independently to verify it works with the URSA SLURM configuration
- Check mpiexec stderr routing β The change to route stderr to
mpmd.?.outmay suppress critical errors visible in the main log
5.3 For Python Workflow Robustness
- Add Python signal handling β The AnalysisStats class should catch
SIGTERM/SIGINTfor graceful shutdown on SLURM cancellation - Add startup health check β Log completion of
__init__and key config validation before proceeding to compute-intensive work - EE2 Compliance note β Python ex-scripts should follow the same error-handling contract as shell ex-scripts (descriptive
FATAL ERROR:messages)
6. Affected Components
| Component | Path | Role |
|---|---|---|
JGLOBAL_ANALYSIS_STATS |
dev/jobs/JGLOBAL_ANALYSIS_STATS |
J-Job wrapper (65 lines) |
exglobal_analysis_stats.py |
dev/scripts/exglobal_analysis_stats.py |
Python ex-script (1,495 bytes) |
AnalysisStats |
pygfs.task.analysis_stats |
Python task class |
cast_strdict_as_dtypedict |
wxflow |
Envβdict converter (62 callers) |
Logger |
wxflow |
Structured logging (88 callers) |
convert_gsi_diags |
pygfs.task.analysis_stats |
GSI diag converter |
config.anlstat |
EXPDIR/config.anlstat |
Task config (sets TASK_CONFIG_YAML) |
config.resources |
EXPDIR/config.resources |
Resource allocation |
URSA.env |
env/URSA.env |
Platform environment |
anlstat_config.yaml.j2 |
parm/gdas/anlstat/ |
YAML template for JEDI stats |
exglobal_atmos_analysis.sh |
scripts/ |
Modified by PR #4451 (upstream) |
7. MCP Tool Coverage Report
This analysis utilized 35+ MCP tool invocations across all 7 tool modules, serving as a comprehensive validation of the Python GraphRAG update (Phase 24I):
Tool Module Coverage
| Module | Tools Used | Status |
|---|---|---|
| WorkflowInfoTools | get_workflow_structure, get_system_configs, describe_component |
β All working |
| CodeAnalysisTools | analyze_code_structure, find_dependencies, find_callers_callees, find_env_dependencies, trace_execution_path |
β All working (Python CALLS/IMPORTS visible) |
| SemanticSearchTools | search_documentation, explain_with_context, find_related_files, get_knowledge_base_status, list_ingested_urls |
β All working |
| EE2ComplianceTools | search_ee2_standards, analyze_ee2_compliance |
β Working |
| OperationalTools | get_operational_guidance, list_job_scripts, explain_workflow_component |
β Working |
| GraphRAGTools | get_code_context, search_architecture, find_similar_code, get_change_impact, trace_data_flow |
β All working (GGSR with Python nodes) |
| GitHubTools | search_issues, get_pull_requests, analyze_repository_structure, analyze_workflow_dependencies |
β Working |
| SDDTools | list_sdd_workflows, get_sdd_workflow, get_sdd_framework_status |
β Working (45 workflows, Phase 24I confirmed) |
| HealthTools | mcp_health_check, get_server_info |
β Healthy (42 tools, 60,404 docs, 483,754 rels) |
Python GraphRAG Validation Summary
| Feature | Status | Evidence |
|---|---|---|
| Python CALLS relationships | β | trace_data_flow shows 7 CALLS from exglobal_analysis_stats |
| Python IMPORTS relationships | β | 3 IMPORTS (os, pygfs.task.analysis_stats, wxflow) |
| Cross-language INVOKES | β | Shell J-Job INVOKES Python ex-script |
| Python Fan-in analysis | β | cast_strdict_as_dtypedict: 62 callers, Logger: 88 callers |
| GGSR hop traversal | β | Hop1=3, Hop2=12-17 entities at 77-88ms latency |
| Community detection | β | Communities 57, 69, 1616, 3595, 3628 reachable |
8. Log Evidence Summary
| Timestamp | Event | Status |
|---|---|---|
| 18:57:11 | JGLOBAL_ANALYSIS_STATS begins |
β |
| 18:57:11-18:57:19 | jjob_header.sh sources configs, sets up environment |
β |
| 18:57:11 | setpdy.sh β COMROOT/date/t18z not found |
β οΈ Non-fatal |
| 18:57:11 | ./PDY β No such file |
β οΈ Non-fatal |
| 18:57:11 | config.base sourced β S2SW, DOIAU=YES, 128 LEVS |
β |
| 18:57:11 | config.anlstat β TASK_CONFIG_YAML set |
β |
| 18:57:11 | URSA.env β APRUN_ANLSTAT configured |
β |
| 18:57:19 | COM directories created (atmos, snow analysis/anlmon) | β |
| 18:57:19 | EXSCRIPT set to exglobal_analysis_stats.py |
β |
| 18:57:20 | Python: BEGIN: AnalysisStats.__init__ |
β |
| 18:57:24 | slurmstepd: error: *** JOB 7652445 ON u01c30 CANCELLED *** |
β |
Total execution time: 13 seconds (job start to SLURM cancellation)
Python execution time: 4 seconds (before external kill)
9. Knowledge Base Statistics at Time of Analysis
| Category | Count |
|---|---|
| ChromaDB Documents | 60,404 |
| ChromaDB Collections | 5 |
| Neo4j Files | 2,744 |
| Neo4j Functions | 1,540 |
| Neo4j Classes | 54 |
| Neo4j Relationships | 483,754 |
| Shell Scripts | 314 (89 J-Jobs, 6 Ex-Scripts, 6 USH) |
| Environment Variables | 2,730 |
| SDD Workflows | 45 |
| MCP Tools | 42 |