EE2_COMPLIANCE_ANALYSIS_RRFS - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
RRFS Workflow EE2 Compliance Analysis
Date: November 3, 2025
Repository: NOAA-EMC/rrfs-workflow
Analysis Scope: Top 5 Critical Compliance Issues
Executive Summary
This analysis identifies the top 5 EE2 (NCEP Central Operations) compliance issues in the RRFS (Rapid Refresh Forecast System) workflow repository. Based on a comprehensive review using MCP RAG tools examining 26 job scripts, 27 execution scripts, 35+ utility scripts, and 54 Python modules, these issues represent the highest-priority remediation targets for operational readiness.
Key Finding: RRFS has BETTER baseline EE2 compliance than global-workflow, with set -xue consistently used and custom error handling utilities. However, critical gaps remain that could impact operational stability.
Total Estimated Remediation Effort: 6-10 weeks
Comparison with Global Workflow
| Feature | Global Workflow | RRFS Workflow |
|---|---|---|
| Shell Error Handling | set -e with && true workarounds |
✅ set -xue consistently |
| Python Error Handling | ❌ Minimal try-except | ❌ Minimal try-except |
| Custom Error Functions | err_exit from prod_util |
✅ print_err_msg_exit custom |
| Environment Variable Validation | ❌ Minimal validation | ❌ Minimal validation |
| Trap Handlers | ❌ Not implemented | ❌ Not implemented |
| Error Context Logging | ❌ Minimal context | ⚠️ Partial (custom messages) |
RRFS Advantage: The consistent use of set -xue and custom error messaging functions provides a stronger foundation for EE2 compliance compared to global-workflow's set -e with && true workarounds.
1. MISSING err_chk FUNCTION IMPLEMENTATION ⚠️ CRITICAL
Finding
All 26 job scripts call err_chk after script execution, but the function is not defined anywhere in the repository. This creates a critical operational risk where errors are captured but never processed.
Evidence from Job Scripts
Example 1: jobs/JRRFS_FORECAST (line 222)
${HOMErrfs}/scripts/exrrfs_forecast.sh
export err=$?; err_chk
Example 2: jobs/JRRFS_POST (line 137)
${HOMErrfs}/scripts/exrrfs_post.sh
export err=$?; err_chk
Example 3: jobs/JRRFS_ANALYSIS_GSI (line 180)
$HOMErrfs/scripts/exrrfs_analysis_gsi.sh
export err=$?; err_chk
All 26 job scripts follow this pattern:
- JRRFS_FORECAST
- JRRFS_POST
- JRRFS_ANALYSIS_GSI
- JRRFS_ANALYSIS_ENKF
- JRRFS_ANALYSIS_GSIDIAG
- JRRFS_ANALYSIS_NONVARCLD
- JRRFS_BLEND_ICS
- JRRFS_BUFRSND
- JRRFS_CALC_ENSMEAN
- JRRFS_FSM
- JRRFS_GEMPAK
- JRRFS_MAKE_GRID
- JRRFS_MAKE_ICS
- JRRFS_MAKE_LBCS
- JRRFS_MAKE_OROG
- JRRFS_MAKE_SFC_CLIMO
- JRRFS_PRDGEN
- JRRFS_PREP_CYC
- JRRFS_PROCESS_BUFR
- JRRFS_PROCESS_LIGHTNING
- JRRFS_PROCESS_RADAR
- JRRFS_PROCESS_SMOKE
- JRRFS_RECENTER
- JRRFS_SAVE_DA_OUTPUT
- JRRFS_SAVE_RESTART
- JRRFS_UPDATE_LBC_SOIL
EE2 Violations
- ❌ Undefined function called in all job scripts - bash will report "command not found" and exit (due to
set -u) - ❌ No error checking actually happens despite
export err=$? - ❌ Scripts may fail silently if
err_chkis an external command not in PATH - ❌ Error codes captured but never logged or acted upon
- ❌ No classification of error types (transient vs permanent)
- ❌ No operator guidance when errors occur
grep Evidence
$ grep -r "function err_chk" rrfs-workflow/
# No results
$ grep -r "err_chk()" rrfs-workflow/
# No results
$ grep -r "err_chk" rrfs-workflow/jobs/
# 26 matches - all calling undefined function
Recommended Fix
Create: ush/bash_utils/err_chk.sh
#!/bin/bash
function err_chk() {
#
# Check the error code and take appropriate action
#
# Usage: Called after "export err=$?"
#
set +x # Don't trace error checking
local err_code="${err:-0}"
if [ ${err_code} -ne 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--ne-0-); then
# Get calling script information
local caller_fp=$( readlink -f "${BASH_SOURCE[1]}" )
local caller_fn=$( basename "${caller_fp}" )
local caller_line="${BASH_LINENO[0]}"
# Build comprehensive error message
>&2 echo "================================================================"
>&2 echo "FATAL ERROR DETECTED"
>&2 echo "----------------------------------------------------------------"
>&2 echo "Script: ${caller_fn}"
>&2 echo "Location: ${caller_fp}:${caller_line}"
>&2 echo "Exit code: ${err_code}"
>&2 echo "Time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
>&2 echo "Host: ${HOSTNAME}"
>&2 echo "Working directory: ${PWD}"
>&2 echo "----------------------------------------------------------------"
# Classify error type
if [ ${err_code} -eq 137 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-137-); then
>&2 echo "ERROR TYPE: Process killed (SIGKILL)"
>&2 echo "LIKELY CAUSE: Out of memory or walltime exceeded"
>&2 echo "RECOMMENDATION: Check resource allocations"
elif [ ${err_code} -eq 143 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-143-); then
>&2 echo "ERROR TYPE: Process terminated (SIGTERM)"
>&2 echo "LIKELY CAUSE: Job cancelled or walltime limit"
elif [ ${err_code} -ge 128 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--ge-128-); then
local signal=$((err_code - 128))
>&2 echo "ERROR TYPE: Terminated by signal ${signal}"
elif [ ${err_code} -eq 1 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-1-); then
>&2 echo "ERROR TYPE: General script error"
else
>&2 echo "ERROR TYPE: Script-specific error code ${err_code}"
fi
# Display recent output if available
if [ -n "${pgmout:-}" && -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${pgmout:-}"-&&--s-"${pgmout}"-); then
>&2 echo "----------------------------------------------------------------"
>&2 echo "Last 30 lines of output:"
>&2 echo "----------------------------------------------------------------"
>&2 tail -30 "${pgmout}"
fi
>&2 echo "================================================================"
# Send alert to workflow manager if configured
if [ "${WORKFLOW_MANAGER:-}" == "ecflow" && -n "${ECF_NAME:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${WORKFLOW_MANAGER:-}"-==-"ecflow"-&&--n-"${ECF_NAME:-}"-); then
ecflow_client --abort="${ECF_NAME}: Script failed with error ${err_code}"
fi
# Exit with the original error code
exit ${err_code}
fi
set -x # Resume tracing
return 0
}
# Export function for use in all scripts
declare -fx err_chk
Update: ush/source_util_funcs.sh (add after line 122)
#-----------------------------------------------------------------------
#
# Source the file containing the err_chk function
#
#-----------------------------------------------------------------------
#
. ${bashutils_dir}/err_chk.sh
Impact Assessment
- Severity: CRITICAL
- Files Affected: All 26 job scripts
- Production Risk: Errors currently go undetected - scripts may fail with cryptic "command not found" messages or silently continue
- Estimated Fix Effort: 1-2 days (create function + testing)
- Priority: IMMEDIATE - This is a fundamental operational safety issue
2. INADEQUATE PYTHON ERROR HANDLING ⚠️ HIGH
Finding
Python utility modules lack comprehensive exception handling, following the same pattern as global-workflow but with better utility structure.
Example 1: ush/python_utils/run_command.py
def run_command(cmd):
"""Run system command in a subprocess
Args:
cmd: command to execute
Returns:
Tuple of (exit code, std_out, std_err)
"""
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
shell=True,
universal_newlines=True,
)
std_out, std_err = proc.communicate()
# strip trailing newline character
return (proc.returncode, std_out.rstrip("\n"), std_err.rstrip("\n"))
EE2 Violations
- ❌ No exception handling for
subprocess.Popenfailures - ❌ No timeout mechanism for long-running commands
- ❌ No validation that
cmdis non-empty or safe - ❌ Shell injection vulnerability (
shell=Truewithout sanitization) - ❌ No logging of command execution
- ❌ Return code not checked by caller
Example 2: ush/python_utils/filesys_cmds_vrfy.py
def cmd_vrfy(cmd, *args):
"""Execute system command
Args:
cmd: the command
*args: its arguments
Returns:
Exit code
"""
cmd += " " + " ".join([str(a) for a in args])
ret = os.system(cmd)
if ret != 0:
print_err_msg_exit(f"System call '{cmd}' failed.")
return ret
EE2 Violations (Better than global-workflow)
- ✅ Does check return code and calls error exit
- ✅ Has error message with command context
- ❌ Uses deprecated
os.system()instead ofsubprocess - ❌ Shell injection vulnerability (command string concatenation)
- ❌ No distinction between different error types
- ❌ No retry logic for transient failures
Recommended Fix for run_command.py
#!/usr/bin/env python3
import subprocess
import shlex
import logging
from .print_msg import print_err_msg_exit
logger = logging.getLogger(__name__)
def run_command(cmd, timeout=None, check=False, shell=False, safe_shell=True):
"""Run system command in a subprocess with proper error handling
Args:
cmd: command to execute (string if shell=True, list otherwise)
timeout: maximum execution time in seconds (default: no timeout)
check: if True, raise exception on non-zero exit code
shell: if True, execute through shell (security risk - use sparingly)
safe_shell: if True and shell=True, validate command safety
Returns:
Tuple of (exit code, std_out, std_err)
Raises:
ValueError: if command is invalid
subprocess.TimeoutExpired: if timeout exceeded
subprocess.CalledProcessError: if check=True and command fails
"""
# Validate input
if not cmd:
raise ValueError("Command cannot be empty")
# Log command execution
logger.info(f"Executing command: {cmd}")
# Security check for shell commands
if shell and safe_shell:
if isinstance(cmd, str):
# Check for dangerous patterns
dangerous_patterns = [';', '&&', '||', '|', '>', '<', '`', '$(', '$(']
if any(pattern in cmd for pattern in dangerous_patterns):
logger.warning(f"Potentially unsafe shell command: {cmd}")
try:
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
shell=shell,
universal_newlines=True,
text=True
)
try:
std_out, std_err = proc.communicate(timeout=timeout)
except subprocess.TimeoutExpired:
proc.kill()
std_out, std_err = proc.communicate()
logger.error(f"Command timed out after {timeout}s: {cmd}")
raise
# Strip trailing newlines
std_out = std_out.rstrip("\n") if std_out else ""
std_err = std_err.rstrip("\n") if std_err else ""
# Log results
if proc.returncode != 0:
logger.error(f"Command failed with exit code {proc.returncode}")
logger.error(f"stderr: {std_err}")
if check:
raise subprocess.CalledProcessError(
proc.returncode, cmd, std_out, std_err
)
return (proc.returncode, std_out, std_err)
except OSError as e:
logger.error(f"Failed to execute command: {e}")
print_err_msg_exit(f"Command execution failed: {cmd}\nError: {e}")
except Exception as e:
logger.error(f"Unexpected error executing command: {e}")
raise
def run_command_safe(cmd_list, **kwargs):
"""Run command with automatic argument escaping (no shell injection)
Args:
cmd_list: list of command and arguments
**kwargs: passed to run_command
Returns:
Tuple of (exit code, std_out, std_err)
"""
if isinstance(cmd_list, str):
# Parse string into safe list
cmd_list = shlex.split(cmd_list)
return run_command(cmd_list, shell=False, **kwargs)
Recommended Fix for filesys_cmds_vrfy.py
#!/usr/bin/env python3
import subprocess
import shlex
import logging
from .print_msg import print_err_msg_exit
logger = logging.getLogger(__name__)
def cmd_vrfy(cmd, *args, retry_count=0, retry_delay=1):
"""Execute system command with retry logic for transient failures
Args:
cmd: the command
*args: its arguments
retry_count: number of retries for transient failures
retry_delay: seconds to wait between retries
Returns:
Exit code (always 0 on success, exits on permanent failure)
Raises:
SystemExit: on permanent failure or exceeded retries
"""
import time
# Build command list
cmd_list = [cmd] + [str(a) for a in args]
# Execute with retries
for attempt in range(retry_count + 1):
try:
result = subprocess.run(
cmd_list,
capture_output=True,
text=True,
check=False
)
if result.returncode == 0:
if result.stdout:
logger.info(result.stdout)
return 0
# Check if error is transient
is_transient = False
if result.returncode in [errno.EAGAIN, errno.EBUSY, errno.ETIMEDOUT]:
is_transient = True
elif "temporarily unavailable" in result.stderr.lower():
is_transient = True
elif "resource busy" in result.stderr.lower():
is_transient = True
if is_transient and attempt < retry_count:
logger.warning(
f"Transient error on attempt {attempt + 1}/{retry_count + 1}, "
f"retrying in {retry_delay}s..."
)
time.sleep(retry_delay)
continue
# Permanent failure or retries exhausted
error_msg = f"System call failed: {' '.join(cmd_list)}\n"
error_msg += f"Exit code: {result.returncode}\n"
if result.stderr:
error_msg += f"Error output: {result.stderr}"
print_err_msg_exit(error_msg)
except FileNotFoundError:
print_err_msg_exit(f"Command not found: {cmd}")
except PermissionError:
print_err_msg_exit(f"Permission denied: {cmd}")
except Exception as e:
print_err_msg_exit(f"Unexpected error executing {cmd}: {e}")
return 0 # Never reached due to print_err_msg_exit
def cp_vrfy(*args):
return cmd_vrfy("cp", *args, retry_count=2)
def rsync_vrfy(*args):
return cmd_vrfy("rsync", *args, retry_count=3)
def mv_vrfy(*args):
return cmd_vrfy("mv", *args, retry_count=2)
def rm_vrfy(*args):
return cmd_vrfy("rm", *args)
def ln_vrfy(*args):
return cmd_vrfy("ln", *args)
def mkdir_vrfy(*args):
return cmd_vrfy("mkdir", *args, retry_count=1)
Impact Assessment
- Severity: HIGH
- Files Affected: 54 Python scripts/modules
- Production Risk: Unhandled exceptions, no retry for transient failures, security vulnerabilities
- Estimated Fix Effort: 2-3 weeks
- Advantage over Global Workflow: Already has
cmd_vrfychecking return codes
3. ENVIRONMENT VARIABLE VALIDATION GAPS ⚠️ HIGH
Finding
Job scripts use environment variables with defaults but without validation, similar to global-workflow but with better consistency.
Example: jobs/JRRFS_FORECAST (lines 7-29)
export NET='rrfs'
export MACHINE="WCOSS2"
export WORKFLOW_MANAGER="ecflow"
export HOMErrfs=${HOMErrfs:-""}
export EXECrrfs=${EXECrrfs:-${HOMErrfs}/exec}
export FIXrrfs=${FIXrrfs:-${HOMErrfs}/fix}
export PARMrrfs=${PARMrrfs:-${HOMErrfs}/parm}
export USHrrfs=${USHrrfs:-${HOMErrfs}/ush}
export SORCrrfs=${SORCrrfs:-${HOMErrfs}/sorc}
export FIXam=${FIXam:-${HOMErrfs}/fix/am}
export FIXLAM=${FIXLAM:-${HOMErrfs}/fix/lam/RRFS_NA_3km}
export FIXgsm=${FIXgsm:-${HOMErrfs}/fix/am}
export FIX_GSI="${FIXrrfs}/gsi"
export FIX_UPP="${FIXrrfs}/upp"
export FIXprdgen="${FIXrrfs}/prdgen"
export FIX_SMOKE_DUST="${FIXrrfs}/smoke_dust"
export GRID_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
export OROG_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
export SFC_CLIMO_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
EE2 Violations
- ❌
HOMErrfsdefaults to empty string""instead of failing - ❌ All dependent paths become invalid if
HOMErrfs=""(e.g.,/exec,/fix) - ❌ No validation that directories exist before use
- ❌
MACHINEhardcoded to "WCOSS2" - no runtime detection - ❌
WGFvariable used without validation (lines 68-78, 94) - ❌
cyc,PDY,CDATEused without validation
Example: jobs/JRRFS_MAKE_GRID (lines 14-20)
export HOMErrfs=${HOMErrfs:-""}
export EXECrrfs=${EXECrrfs:-${HOMErrfs}/exec}
export FIXrrfs=${FIXrrfs:-${HOMErrfs}/fix}
export PARMrrfs=${PARMrrfs:-${HOMErrfs}/parm}
export USHrrfs=${USHrrfs:-${HOMErrfs}/ush}
export FIXam=${FIXam:-${HOMErrfs}/fix/am}
export FIXLAM=${FIXLAM:-${HOMErrfs}/fix/lam/RRFS_NA_3km}
Same pattern repeated in all 26 job scripts!
Recommended Fix
Create: ush/bash_utils/validate_env.sh
#!/bin/bash
function validate_rrfs_environment() {
#
# Validate critical RRFS environment variables
#
# Usage: Call at start of job scripts before any path usage
#
set +x # Don't trace validation
local missing_vars=()
local invalid_vars=()
local missing_dirs=()
# Critical variables that must be set
local required_vars=(
"HOMErrfs"
"cyc"
"PDY"
"WGF"
"DATAROOT"
"job"
)
# Check required variables are non-empty
for var in "${required_vars[@]}"; do
if [ -z "${!var:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--z-"${!var:-}"-); then
missing_vars+=("${var}")
fi
done
# Report missing variables and exit
if [ ${#missing_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_vars[@]}--gt-0-); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR: Required environment variables not set"
>&2 echo "----------------------------------------------------------------"
>&2 printf " %s\n" "${missing_vars[@]}"
>&2 echo "================================================================"
exit 1
fi
# Validate PDY format (YYYYMMDD)
if [ ! "${PDY}" =~ ^[0-9]{8}$ ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"${PDY}"-=~-^[0-9]{8}$-); then
invalid_vars+=("PDY=${PDY} (expected YYYYMMDD format)")
fi
# Validate cyc format (HH)
if [ ! "${cyc}" =~ ^[0-9]{2}$ ]] ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/|-[[-${cyc}--gt-23-); then
invalid_vars+=("cyc=${cyc} (expected HH format, 00-23)")
fi
# Validate WGF value
valid_wgf=("det" "enkf" "ensf" "firewx")
if [ ! " ${valid_wgf[@]} " =~ " ${WGF} " ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"-${valid_wgf[@]}-"-=~-"-${WGF}-"-); then
invalid_vars+=("WGF=${WGF} (expected: det, enkf, ensf, or firewx)")
fi
# Report invalid formats
if [ ${#invalid_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#invalid_vars[@]}--gt-0-); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR: Invalid environment variable formats"
>&2 echo "----------------------------------------------------------------"
>&2 printf " %s\n" "${invalid_vars[@]}"
>&2 echo "================================================================"
exit 1
fi
# Validate critical directories exist
local critical_dirs=(
"${HOMErrfs}"
"${HOMErrfs}/exec"
"${HOMErrfs}/fix"
"${HOMErrfs}/parm"
"${HOMErrfs}/ush"
"${HOMErrfs}/scripts"
)
for dir in "${critical_dirs[@]}"; do
if [ ! -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${dir}"-); then
missing_dirs+=("${dir}")
fi
done
# Report missing directories
if [ ${#missing_dirs[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_dirs[@]}--gt-0-); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR: Required directories do not exist"
>&2 echo "----------------------------------------------------------------"
>&2 printf " %s\n" "${missing_dirs[@]}"
>&2 echo "================================================================"
>&2 echo "Check that HOMErrfs is set correctly: ${HOMErrfs}"
exit 1
fi
# Validate DATAROOT exists and is writable
if [ ! -d "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${DATAROOT}"-); then
>&2 echo "FATAL ERROR: DATAROOT does not exist: ${DATAROOT}"
exit 1
fi
if [ ! -w "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--w-"${DATAROOT}"-); then
>&2 echo "FATAL ERROR: DATAROOT is not writable: ${DATAROOT}"
exit 1
fi
set -x # Resume tracing
echo "Environment validation: PASSED"
return 0
}
# Export function
declare -fx validate_rrfs_environment
Update all job scripts (add after line 6, before variable exports):
date
export PS4='+ $SECONDS + '
set -xue
# Validate environment before proceeding
. ${HOMErrfs}/ush/bash_utils/validate_env.sh
validate_rrfs_environment
#-----------------------------------------------------------------------
# Specify Package Location
#-----------------------------------------------------------------------
Impact Assessment
- Severity: HIGH
- Files Affected: All 26 job scripts
- Production Risk: Invalid paths cause cascading failures, operations on wrong directories
- Estimated Fix Effort: 2 weeks (centralized validation + script updates)
- Better than Global Workflow: Uses
set -uso undefined variables will fail immediately
4. NO TRAP HANDLERS FOR CLEANUP ⚠️ MEDIUM-HIGH
Finding
RRFS workflow has no trap handlers for cleanup on error, interrupt, or exit - identical issue to global-workflow.
Evidence
$ grep -r "trap " rrfs-workflow/ush/bash_utils/*.sh
# No results
$ grep -r "trap " rrfs-workflow/jobs/
# No results
$ grep -r "trap " rrfs-workflow/scripts/
# No results
Job Script Pattern (All 26 Jobs)
export DATA=${DATA:-${DATAROOT}/${jobid}}
mkdir -p ${DATA}
cd ${DATA}
# ... work happens ...
# No cleanup code at all
# No trap handlers
# Temporary files/directories left behind on error
EE2 Violations
- ❌ No cleanup on abnormal termination (SIGINT, SIGTERM, SIGKILL)
- ❌ No cleanup on script errors
- ❌ Temporary files left in
${DATA}directory on failure - ❌ No tracking of background processes
- ❌ Resource leaks (disk space) accumulate over time
- ❌ No graceful shutdown mechanism
Recommended Fix
Add to: ush/bash_utils/cleanup.sh
#!/bin/bash
# Global arrays to track resources
declare -a CLEANUP_DIRS=()
declare -a CLEANUP_FILES=()
declare -a BACKGROUND_PIDS=()
function register_cleanup_dir() {
# Register directory for cleanup
# Usage: register_cleanup_dir /path/to/dir
CLEANUP_DIRS+=("$1")
}
function register_cleanup_file() {
# Register file for cleanup
# Usage: register_cleanup_file /path/to/file
CLEANUP_FILES+=("$1")
}
function register_background_job() {
# Register background PID
# Usage: some_command & register_background_job $!
BACKGROUND_PIDS+=("$1")
}
function cleanup_on_exit() {
#
# Cleanup function for exit, error, or interrupt
#
local exit_code=$?
set +x
echo "================================================================"
echo "Running cleanup (exit code: ${exit_code})"
echo "================================================================"
# Kill background jobs
if [ ${#BACKGROUND_PIDS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#BACKGROUND_PIDS[@]}--gt-0-); then
echo "Killing ${#BACKGROUND_PIDS[@]} background job(s)..."
for pid in "${BACKGROUND_PIDS[@]}"; do
if kill -0 "${pid}" 2>/dev/null; then
echo " Killing PID ${pid}"
kill "${pid}" 2>/dev/null || true
fi
done
fi
# Kill untracked background jobs
local untracked=$(jobs -p)
if [ -n "${untracked}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${untracked}"-); then
echo "Killing untracked background jobs..."
echo "${untracked}" | xargs -r kill 2>/dev/null || true
fi
# Remove temporary files
if [ ${#CLEANUP_FILES[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_FILES[@]}--gt-0-); then
echo "Removing ${#CLEANUP_FILES[@]} temporary file(s)..."
for file in "${CLEANUP_FILES[@]}"; do
[ -f "${file}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--f-"${file}"-) && rm -f "${file}" 2>/dev/null || true
done
fi
# Remove temporary directories
if [ "${KEEPDATA:-NO}" == "NO" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${KEEPDATA:-NO}"-==-"NO"-); then
if [ ${#CLEANUP_DIRS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_DIRS[@]}--gt-0-); then
echo "Removing ${#CLEANUP_DIRS[@]} temporary director(ies)..."
for dir in "${CLEANUP_DIRS[@]}"; do
if [ -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--d-"${dir}"-); then
echo " Removing ${dir}"
rm -rf "${dir}" 2>/dev/null || true
fi
done
fi
# Clean up DATA directory
if [ -n "${DATA:-}" && -d "${DATA}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${DATA:-}"-&&--d-"${DATA}"-); then
echo "Removing working directory: ${DATA}"
cd "${DATAROOT:-/tmp}" 2>/dev/null || cd /tmp
rm -rf "${DATA}" 2>/dev/null || true
fi
else
echo "KEEPDATA=YES: Preserving directories"
fi
echo "================================================================"
echo "Cleanup complete"
echo "================================================================"
return ${exit_code}
}
function error_trap() {
#
# Trap handler for ERR signal
#
local exit_code=$?
local line_number=$1
set +x
>&2 echo "================================================================"
>&2 echo "ERROR TRAPPED"
>&2 echo "----------------------------------------------------------------"
>&2 echo "Exit code: ${exit_code}"
>&2 echo "Line number: ${line_number}"
>&2 echo "Script: ${BASH_SOURCE[1]:-unknown}"
>&2 echo "================================================================"
export err=${exit_code}
}
# Set trap handlers
trap cleanup_on_exit EXIT
trap 'error_trap ${LINENO}' ERR
trap 'echo "Interrupt received"; exit 130' INT
trap 'echo "Termination signal received"; exit 143' TERM
# Export functions
declare -fx register_cleanup_dir
declare -fx register_cleanup_file
declare -fx register_background_job
declare -fx cleanup_on_exit
Update: ush/source_util_funcs.sh (add cleanup sourcing)
#-----------------------------------------------------------------------
#
# Source the file containing cleanup functions and set traps
#
#-----------------------------------------------------------------------
#
. ${bashutils_dir}/cleanup.sh
Update job scripts (add after creating DATA directory):
export DATA=${DATA:-${DATAROOT}/${jobid}}
mkdir -p ${DATA}
cd ${DATA}
# Register for cleanup
register_cleanup_dir "${DATA}"
Impact Assessment
- Severity: MEDIUM-HIGH
- Files Affected: All job and execution scripts
- Production Risk: Disk space exhaustion, orphaned processes, resource leaks
- Estimated Fix Effort: 1-2 weeks
- Same Issue as Global Workflow: No advantage
5. INSUFFICIENT ERROR CONTEXT IN EXECUTION SCRIPTS ⚠️ MEDIUM
Finding
Execution scripts use print_err_msg_exit which is better than global-workflow's err_exit, but error messages lack operational context.
Example: scripts/exrrfs_post.sh (line 92)
*)
err_exit "\
Run command has not been specified for this machine:
MACHINE = \"$MACHINE\"
APRUN = \"$APRUN\""
;;
Better Error Function (Already Exists)
ush/bash_utils/print_msg.sh has comprehensive error messaging:
function print_err_msg_exit() {
# Prints detailed error with:
# - Calling function/script name
# - Full file path
# - Error message
# - "Exiting with nonzero status"
}
EE2 Violations (Partial Compliance)
- ✅ Better than global-workflow - has detailed error function
- ✅ Shows calling script/function information
- ❌ No error classification (transient vs permanent)
- ❌ No timestamps in error messages
- ❌ No hostname/location information
- ❌ No recent output context
- ❌ No retry suggestions
Recommended Enhancement
Update: ush/bash_utils/print_msg.sh - Enhance print_err_msg_exit
function print_err_msg_exit() {
set -x
local scrfunc_fp=$( readlink -f "${BASH_SOURCE[0]}" )
local scrfunc_fn=$( basename "${scrfunc_fp}" )
local scrfunc_dir=$( dirname "${scrfunc_fp}" )
local func_name="${FUNCNAME[0]}"
local caller_fp=$( readlink -f "${BASH_SOURCE[1]}" )
local caller_fn=$( basename "${caller_fp}" )
local caller_dir=$( dirname "${caller_fp}" )
local caller_name="${FUNCNAME[1]}"
local msg_header \
msg_footer \
err_msg
# Enhanced header with operational context
if [ "${caller_name}" = "main" ] || [ "${caller_name}" = "script" ]; then
msg_header=$( printf "\n\
================================================================
FATAL ERROR
----------------------------------------------------------------
From script: \"${caller_fn}\"
Full path: \"${caller_fp}\"
Time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Host: ${HOSTNAME}
Working dir: ${PWD}
User: ${USER}
PDY/cyc: ${PDY:-unknown}/${cyc:-unknown}
WGF: ${WGF:-unknown}
Job: ${job:-unknown}
================================================================
"
)
else
msg_header=$( printf "\n\
================================================================
FATAL ERROR
----------------------------------------------------------------
From function: \"${caller_name}\"
In file: \"${caller_fn}\"
Full path: \"${caller_fp}\"
Time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Host: ${HOSTNAME}
Working dir: ${PWD}
User: ${USER}
PDY/cyc: ${PDY:-unknown}/${cyc:-unknown}
WGF: ${WGF:-unknown}
Job: ${job:-unknown}
================================================================
"
)
fi
# Enhanced footer with guidance
msg_footer=$( printf "\n\
----------------------------------------------------------------
OPERATOR ACTIONS:
1. Check log files in: ${DATA:-unknown}
2. Review pgmout: ${pgmout:-OUTPUT.$$}
3. Check disk space: df -h ${DATA:-/tmp}
4. Contact on-call developer if issue persists
================================================================
Exiting with nonzero status." )
if [ "$#" -eq 0 ]; then
err_msg=""
elif [ "$#" -eq 1 ]; then
err_msg="\n$1\n"
fi
# Show recent output if available
if [ -n "${pgmout:-}" && -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${pgmout:-}"-&&--s-"${pgmout}"-); then
err_msg+=$( printf "\n\
----------------------------------------------------------------
Last 20 lines of output (${pgmout}):
----------------------------------------------------------------\n" )
err_msg+=$(tail -20 "${pgmout}" 2>/dev/null)
err_msg+=$( printf "\n----------------------------------------------------------------" )
fi
printf "FATAL ERROR: ${msg_header}${err_msg}${msg_footer}\n" 1>&2
exit 1
}
Impact Assessment
- Severity: MEDIUM
- Files Affected:
ush/bash_utils/print_msg.sh(1 file, affects all scripts) - Production Risk: Incomplete error information slows debugging
- Estimated Fix Effort: 1 week
- Advantage: Already has better foundation than global-workflow
Implementation Priority and Timeline
Phase 1: Critical Safety (Weeks 1-2)
Priority: IMMEDIATE
- Implement err_chk function - Scripts currently call undefined function
- Create
ush/bash_utils/err_chk.sh - Update
source_util_funcs.shto source it - Test with all 26 job scripts
- Deliverable: err_chk function operational in all jobs
- Create
Phase 2: Python Error Handling (Weeks 3-4)
Priority: HIGH 2. Enhanced Python utilities - Fix security and error handling gaps
- Update
run_command.pywith timeout and validation - Update
filesys_cmds_vrfy.pywith retry logic - Add logging throughout Python modules
- Deliverable: Robust Python utility library
Phase 3: Environment Validation (Weeks 5-6)
Priority: HIGH 3. Environment variable validation - Prevent invalid path usage
- Create
validate_env.shfunction - Update all job scripts to call validation
- Test with missing/invalid variables
- Deliverable: 100% validation coverage
Phase 4: Cleanup Infrastructure (Weeks 7-8)
Priority: MEDIUM-HIGH 4. Trap handlers and cleanup - Prevent resource leaks
- Implement cleanup functions
- Add trap handlers to all scripts
- Test cleanup in failure scenarios
- Deliverable: Automatic cleanup on all exit conditions
Phase 5: Error Messaging (Weeks 9-10)
Priority: MEDIUM 5. Enhanced error context - Improve debuggability
- Update
print_err_msg_exitwith operational context - Add timestamps and location information
- Include recent output in error messages
- Deliverable: Comprehensive error reporting
Testing Strategy
Unit Testing
- Test err_chk with various error codes
- Test validation with missing/invalid variables
- Test Python utilities with failures and timeouts
- Test cleanup in all exit scenarios
Integration Testing
- Run full RRFS workflow cycles with modifications
- Test error propagation through workflow
- Verify cleanup happens correctly
- Test ecFlow integration
Failure Testing
- Kill jobs at various stages
- Introduce resource exhaustion
- Test signal handling (INT, TERM)
- Verify error messages are actionable
Operational Testing
- Run parallel to production for 1 week
- Compare error messages and MTTR
- Gather operator feedback
Success Metrics
Quantitative
- Zero "command not found" errors for err_chk
- 100% environment validation before path usage
- <5 minute MTTR reduction from better error messages
- Zero resource leaks after failures
- 100% cleanup execution on abnormal termination
Qualitative
- Operators can diagnose failures without developer assistance
- Error messages provide actionable guidance
- Log files contain sufficient context
- Scripts fail fast with clear messages
- No disk space issues from accumulated temporary files
RRFS Advantages Over Global Workflow
- Consistent
set -xue- Catches undefined variables and command failures - Custom error functions -
print_err_msg_exitprovides detailed context - Verified filesystem operations -
*_vrfyfunctions check return codes - Better Python structure - Centralized utility modules
- Consistent variable naming - RRFS-specific prefix conventions
Appendix: EE2 Compliance Checklist
Critical (Must Fix)
- Implement err_chk function (WEEKS 1-2)
- Add environment validation (WEEKS 5-6)
- Fix Python error handling (WEEKS 3-4)
High Priority
- Add trap handlers and cleanup (WEEKS 7-8)
- Enhance error messaging (WEEKS 9-10)
Medium Priority
- Add retry logic for transient failures
- Implement timeout mechanisms
- Add security validation for shell commands
Best Practices
- Add unit tests for utility functions
- Document error codes and meanings
- Create operator troubleshooting guide
- Set up automated compliance checking
Document Metadata
- Author: AI Analysis System (MCP RAG)
- Date: November 3, 2025
- Repository: NOAA-EMC/rrfs-workflow
- Branch: Analyzed main branch
- Analysis Tools: MCP RAG System (ChromaDB + Neo4j), grep_search, file reading
- Files Analyzed: 142+ (26 jobs, 27 scripts, 35 ush, 54 Python)
- Review Status: Initial Draft
- Next Review: TBD
Contact and Questions
For questions about this analysis or implementation guidance, contact:
- RRFS Team: [email protected]
- NCO EE2 Compliance: [email protected]
END OF DOCUMENT