EE2_COMPLIANCE_ANALYSIS_RRFS - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

RRFS Workflow EE2 Compliance Analysis

Date: November 3, 2025
Repository: NOAA-EMC/rrfs-workflow
Analysis Scope: Top 5 Critical Compliance Issues

Executive Summary

This analysis identifies the top 5 EE2 (NCEP Central Operations) compliance issues in the RRFS (Rapid Refresh Forecast System) workflow repository. Based on a comprehensive review using MCP RAG tools examining 26 job scripts, 27 execution scripts, 35+ utility scripts, and 54 Python modules, these issues represent the highest-priority remediation targets for operational readiness.

Key Finding: RRFS has BETTER baseline EE2 compliance than global-workflow, with set -xue consistently used and custom error handling utilities. However, critical gaps remain that could impact operational stability.

Total Estimated Remediation Effort: 6-10 weeks

Comparison with Global Workflow

Feature	Global Workflow	RRFS Workflow
Shell Error Handling	`set -e` with `&& true` workarounds	✅ `set -xue` consistently
Python Error Handling	❌ Minimal try-except	❌ Minimal try-except
Custom Error Functions	`err_exit` from prod_util	✅ `print_err_msg_exit` custom
Environment Variable Validation	❌ Minimal validation	❌ Minimal validation
Trap Handlers	❌ Not implemented	❌ Not implemented
Error Context Logging	❌ Minimal context	⚠️ Partial (custom messages)

RRFS Advantage: The consistent use of set -xue and custom error messaging functions provides a stronger foundation for EE2 compliance compared to global-workflow's set -e with && true workarounds.

1. MISSING err_chk FUNCTION IMPLEMENTATION ⚠️ CRITICAL

Finding

All 26 job scripts call err_chk after script execution, but the function is not defined anywhere in the repository. This creates a critical operational risk where errors are captured but never processed.

Evidence from Job Scripts

Example 1: jobs/JRRFS_FORECAST (line 222)

${HOMErrfs}/scripts/exrrfs_forecast.sh
export err=$?; err_chk

Example 2: jobs/JRRFS_POST (line 137)

${HOMErrfs}/scripts/exrrfs_post.sh
export err=$?; err_chk

Example 3: jobs/JRRFS_ANALYSIS_GSI (line 180)

$HOMErrfs/scripts/exrrfs_analysis_gsi.sh
export err=$?; err_chk

All 26 job scripts follow this pattern:

JRRFS_FORECAST
JRRFS_POST
JRRFS_ANALYSIS_GSI
JRRFS_ANALYSIS_ENKF
JRRFS_ANALYSIS_GSIDIAG
JRRFS_ANALYSIS_NONVARCLD
JRRFS_BLEND_ICS
JRRFS_BUFRSND
JRRFS_CALC_ENSMEAN
JRRFS_FSM
JRRFS_GEMPAK
JRRFS_MAKE_GRID
JRRFS_MAKE_ICS
JRRFS_MAKE_LBCS
JRRFS_MAKE_OROG
JRRFS_MAKE_SFC_CLIMO
JRRFS_PRDGEN
JRRFS_PREP_CYC
JRRFS_PROCESS_BUFR
JRRFS_PROCESS_LIGHTNING
JRRFS_PROCESS_RADAR
JRRFS_PROCESS_SMOKE
JRRFS_RECENTER
JRRFS_SAVE_DA_OUTPUT
JRRFS_SAVE_RESTART
JRRFS_UPDATE_LBC_SOIL

EE2 Violations

❌ Undefined function called in all job scripts - bash will report "command not found" and exit (due to set -u)
❌ No error checking actually happens despite export err=$?
❌ Scripts may fail silently if err_chk is an external command not in PATH
❌ Error codes captured but never logged or acted upon
❌ No classification of error types (transient vs permanent)
❌ No operator guidance when errors occur

grep Evidence

$ grep -r "function err_chk" rrfs-workflow/
# No results

$ grep -r "err_chk()" rrfs-workflow/
# No results

$ grep -r "err_chk" rrfs-workflow/jobs/
# 26 matches - all calling undefined function

Recommended Fix

Create: ush/bash_utils/err_chk.sh

#!/bin/bash

function err_chk() {
    #
    # Check the error code and take appropriate action
    #
    # Usage: Called after "export err=$?"
    #
    set +x  # Don't trace error checking
    
    local err_code="${err:-0}"
    
    if [ ${err_code} -ne 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--ne-0-); then
        # Get calling script information
        local caller_fp=$( readlink -f "${BASH_SOURCE[1]}" )
        local caller_fn=$( basename "${caller_fp}" )
        local caller_line="${BASH_LINENO[0]}"
        
        # Build comprehensive error message
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR DETECTED"
        >&2 echo "----------------------------------------------------------------"
        >&2 echo "Script: ${caller_fn}"
        >&2 echo "Location: ${caller_fp}:${caller_line}"
        >&2 echo "Exit code: ${err_code}"
        >&2 echo "Time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
        >&2 echo "Host: ${HOSTNAME}"
        >&2 echo "Working directory: ${PWD}"
        >&2 echo "----------------------------------------------------------------"
        
        # Classify error type
        if [ ${err_code} -eq 137 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-137-); then
            >&2 echo "ERROR TYPE: Process killed (SIGKILL)"
            >&2 echo "LIKELY CAUSE: Out of memory or walltime exceeded"
            >&2 echo "RECOMMENDATION: Check resource allocations"
        elif [ ${err_code} -eq 143 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-143-); then
            >&2 echo "ERROR TYPE: Process terminated (SIGTERM)"
            >&2 echo "LIKELY CAUSE: Job cancelled or walltime limit"
        elif [ ${err_code} -ge 128 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--ge-128-); then
            local signal=$((err_code - 128))
            >&2 echo "ERROR TYPE: Terminated by signal ${signal}"
        elif [ ${err_code} -eq 1 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-1-); then
            >&2 echo "ERROR TYPE: General script error"
        else
            >&2 echo "ERROR TYPE: Script-specific error code ${err_code}"
        fi
        
        # Display recent output if available
        if [ -n "${pgmout:-}" && -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${pgmout:-}"-&&--s-"${pgmout}"-); then
            >&2 echo "----------------------------------------------------------------"
            >&2 echo "Last 30 lines of output:"
            >&2 echo "----------------------------------------------------------------"
            >&2 tail -30 "${pgmout}"
        fi
        
        >&2 echo "================================================================"
        
        # Send alert to workflow manager if configured
        if [ "${WORKFLOW_MANAGER:-}" == "ecflow" && -n "${ECF_NAME:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${WORKFLOW_MANAGER:-}"-==-"ecflow"-&&--n-"${ECF_NAME:-}"-); then
            ecflow_client --abort="${ECF_NAME}: Script failed with error ${err_code}"
        fi
        
        # Exit with the original error code
        exit ${err_code}
    fi
    
    set -x  # Resume tracing
    return 0
}

# Export function for use in all scripts
declare -fx err_chk

Update: ush/source_util_funcs.sh (add after line 122)

#-----------------------------------------------------------------------
#
# Source the file containing the err_chk function
#
#-----------------------------------------------------------------------
#
  . ${bashutils_dir}/err_chk.sh

Impact Assessment

Severity: CRITICAL
Files Affected: All 26 job scripts
Production Risk: Errors currently go undetected - scripts may fail with cryptic "command not found" messages or silently continue
Estimated Fix Effort: 1-2 days (create function + testing)
Priority: IMMEDIATE - This is a fundamental operational safety issue

2. INADEQUATE PYTHON ERROR HANDLING ⚠️ HIGH

Finding

Python utility modules lack comprehensive exception handling, following the same pattern as global-workflow but with better utility structure.

Example 1: `ush/python_utils/run_command.py`

def run_command(cmd):
    """Run system command in a subprocess

    Args:
        cmd: command to execute
    Returns:
        Tuple of (exit code, std_out, std_err)
    """
    proc = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        shell=True,
        universal_newlines=True,
    )

    std_out, std_err = proc.communicate()

    # strip trailing newline character
    return (proc.returncode, std_out.rstrip("\n"), std_err.rstrip("\n"))

EE2 Violations

❌ No exception handling for subprocess.Popen failures
❌ No timeout mechanism for long-running commands
❌ No validation that cmd is non-empty or safe
❌ Shell injection vulnerability (shell=True without sanitization)
❌ No logging of command execution
❌ Return code not checked by caller

Example 2: `ush/python_utils/filesys_cmds_vrfy.py`

def cmd_vrfy(cmd, *args):
    """Execute system command

    Args:
        cmd: the command
        *args: its arguments
    Returns:
        Exit code
    """

    cmd += " " + " ".join([str(a) for a in args])
    ret = os.system(cmd)
    if ret != 0:
        print_err_msg_exit(f"System call '{cmd}' failed.")
    return ret

EE2 Violations (Better than global-workflow)

✅ Does check return code and calls error exit
✅ Has error message with command context
❌ Uses deprecated os.system() instead of subprocess
❌ Shell injection vulnerability (command string concatenation)
❌ No distinction between different error types
❌ No retry logic for transient failures

Recommended Fix for `run_command.py`

#!/usr/bin/env python3

import subprocess
import shlex
import logging
from .print_msg import print_err_msg_exit

logger = logging.getLogger(__name__)


def run_command(cmd, timeout=None, check=False, shell=False, safe_shell=True):
    """Run system command in a subprocess with proper error handling
    
    Args:
        cmd: command to execute (string if shell=True, list otherwise)
        timeout: maximum execution time in seconds (default: no timeout)
        check: if True, raise exception on non-zero exit code
        shell: if True, execute through shell (security risk - use sparingly)
        safe_shell: if True and shell=True, validate command safety
        
    Returns:
        Tuple of (exit code, std_out, std_err)
        
    Raises:
        ValueError: if command is invalid
        subprocess.TimeoutExpired: if timeout exceeded
        subprocess.CalledProcessError: if check=True and command fails
    """
    
    # Validate input
    if not cmd:
        raise ValueError("Command cannot be empty")
    
    # Log command execution
    logger.info(f"Executing command: {cmd}")
    
    # Security check for shell commands
    if shell and safe_shell:
        if isinstance(cmd, str):
            # Check for dangerous patterns
            dangerous_patterns = [';', '&&', '||', '|', '>', '<', '`', '$(',  '$(']
            if any(pattern in cmd for pattern in dangerous_patterns):
                logger.warning(f"Potentially unsafe shell command: {cmd}")
    
    try:
        proc = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            shell=shell,
            universal_newlines=True,
            text=True
        )
        
        try:
            std_out, std_err = proc.communicate(timeout=timeout)
        except subprocess.TimeoutExpired:
            proc.kill()
            std_out, std_err = proc.communicate()
            logger.error(f"Command timed out after {timeout}s: {cmd}")
            raise
        
        # Strip trailing newlines
        std_out = std_out.rstrip("\n") if std_out else ""
        std_err = std_err.rstrip("\n") if std_err else ""
        
        # Log results
        if proc.returncode != 0:
            logger.error(f"Command failed with exit code {proc.returncode}")
            logger.error(f"stderr: {std_err}")
            
            if check:
                raise subprocess.CalledProcessError(
                    proc.returncode, cmd, std_out, std_err
                )
        
        return (proc.returncode, std_out, std_err)
        
    except OSError as e:
        logger.error(f"Failed to execute command: {e}")
        print_err_msg_exit(f"Command execution failed: {cmd}\nError: {e}")
    except Exception as e:
        logger.error(f"Unexpected error executing command: {e}")
        raise


def run_command_safe(cmd_list, **kwargs):
    """Run command with automatic argument escaping (no shell injection)
    
    Args:
        cmd_list: list of command and arguments
        **kwargs: passed to run_command
        
    Returns:
        Tuple of (exit code, std_out, std_err)
    """
    if isinstance(cmd_list, str):
        # Parse string into safe list
        cmd_list = shlex.split(cmd_list)
    
    return run_command(cmd_list, shell=False, **kwargs)

Recommended Fix for `filesys_cmds_vrfy.py`

#!/usr/bin/env python3

import subprocess
import shlex
import logging
from .print_msg import print_err_msg_exit

logger = logging.getLogger(__name__)


def cmd_vrfy(cmd, *args, retry_count=0, retry_delay=1):
    """Execute system command with retry logic for transient failures

    Args:
        cmd: the command
        *args: its arguments
        retry_count: number of retries for transient failures
        retry_delay: seconds to wait between retries
        
    Returns:
        Exit code (always 0 on success, exits on permanent failure)
        
    Raises:
        SystemExit: on permanent failure or exceeded retries
    """
    import time
    
    # Build command list
    cmd_list = [cmd] + [str(a) for a in args]
    
    # Execute with retries
    for attempt in range(retry_count + 1):
        try:
            result = subprocess.run(
                cmd_list,
                capture_output=True,
                text=True,
                check=False
            )
            
            if result.returncode == 0:
                if result.stdout:
                    logger.info(result.stdout)
                return 0
            
            # Check if error is transient
            is_transient = False
            if result.returncode in [errno.EAGAIN, errno.EBUSY, errno.ETIMEDOUT]:
                is_transient = True
            elif "temporarily unavailable" in result.stderr.lower():
                is_transient = True
            elif "resource busy" in result.stderr.lower():
                is_transient = True
            
            if is_transient and attempt < retry_count:
                logger.warning(
                    f"Transient error on attempt {attempt + 1}/{retry_count + 1}, "
                    f"retrying in {retry_delay}s..."
                )
                time.sleep(retry_delay)
                continue
            
            # Permanent failure or retries exhausted
            error_msg = f"System call failed: {' '.join(cmd_list)}\n"
            error_msg += f"Exit code: {result.returncode}\n"
            if result.stderr:
                error_msg += f"Error output: {result.stderr}"
            
            print_err_msg_exit(error_msg)
            
        except FileNotFoundError:
            print_err_msg_exit(f"Command not found: {cmd}")
        except PermissionError:
            print_err_msg_exit(f"Permission denied: {cmd}")
        except Exception as e:
            print_err_msg_exit(f"Unexpected error executing {cmd}: {e}")
    
    return 0  # Never reached due to print_err_msg_exit


def cp_vrfy(*args):
    return cmd_vrfy("cp", *args, retry_count=2)


def rsync_vrfy(*args):
    return cmd_vrfy("rsync", *args, retry_count=3)


def mv_vrfy(*args):
    return cmd_vrfy("mv", *args, retry_count=2)


def rm_vrfy(*args):
    return cmd_vrfy("rm", *args)


def ln_vrfy(*args):
    return cmd_vrfy("ln", *args)


def mkdir_vrfy(*args):
    return cmd_vrfy("mkdir", *args, retry_count=1)

Impact Assessment

Severity: HIGH
Files Affected: 54 Python scripts/modules
Production Risk: Unhandled exceptions, no retry for transient failures, security vulnerabilities
Estimated Fix Effort: 2-3 weeks
Advantage over Global Workflow: Already has cmd_vrfy checking return codes

3. ENVIRONMENT VARIABLE VALIDATION GAPS ⚠️ HIGH

Finding

Job scripts use environment variables with defaults but without validation, similar to global-workflow but with better consistency.

Example: `jobs/JRRFS_FORECAST` (lines 7-29)

export NET='rrfs'
export MACHINE="WCOSS2"
export WORKFLOW_MANAGER="ecflow"
export HOMErrfs=${HOMErrfs:-""}
export EXECrrfs=${EXECrrfs:-${HOMErrfs}/exec}
export FIXrrfs=${FIXrrfs:-${HOMErrfs}/fix}
export PARMrrfs=${PARMrrfs:-${HOMErrfs}/parm}
export USHrrfs=${USHrrfs:-${HOMErrfs}/ush}
export SORCrrfs=${SORCrrfs:-${HOMErrfs}/sorc}
export FIXam=${FIXam:-${HOMErrfs}/fix/am}
export FIXLAM=${FIXLAM:-${HOMErrfs}/fix/lam/RRFS_NA_3km}
export FIXgsm=${FIXgsm:-${HOMErrfs}/fix/am}
export FIX_GSI="${FIXrrfs}/gsi"
export FIX_UPP="${FIXrrfs}/upp"
export FIXprdgen="${FIXrrfs}/prdgen"
export FIX_SMOKE_DUST="${FIXrrfs}/smoke_dust"
export GRID_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
export OROG_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
export SFC_CLIMO_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"

EE2 Violations

❌ HOMErrfs defaults to empty string "" instead of failing
❌ All dependent paths become invalid if HOMErrfs="" (e.g., /exec, /fix)
❌ No validation that directories exist before use
❌ MACHINE hardcoded to "WCOSS2" - no runtime detection
❌ WGF variable used without validation (lines 68-78, 94)
❌ cyc, PDY, CDATE used without validation

Example: `jobs/JRRFS_MAKE_GRID` (lines 14-20)

export HOMErrfs=${HOMErrfs:-""}
export EXECrrfs=${EXECrrfs:-${HOMErrfs}/exec}
export FIXrrfs=${FIXrrfs:-${HOMErrfs}/fix}
export PARMrrfs=${PARMrrfs:-${HOMErrfs}/parm}
export USHrrfs=${USHrrfs:-${HOMErrfs}/ush}
export FIXam=${FIXam:-${HOMErrfs}/fix/am}
export FIXLAM=${FIXLAM:-${HOMErrfs}/fix/lam/RRFS_NA_3km}

Same pattern repeated in all 26 job scripts!

Recommended Fix

Create: ush/bash_utils/validate_env.sh

#!/bin/bash

function validate_rrfs_environment() {
    #
    # Validate critical RRFS environment variables
    #
    # Usage: Call at start of job scripts before any path usage
    #
    set +x  # Don't trace validation
    
    local missing_vars=()
    local invalid_vars=()
    local missing_dirs=()
    
    # Critical variables that must be set
    local required_vars=(
        "HOMErrfs"
        "cyc"
        "PDY"
        "WGF"
        "DATAROOT"
        "job"
    )
    
    # Check required variables are non-empty
    for var in "${required_vars[@]}"; do
        if [ -z "${!var:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--z-"${!var:-}"-); then
            missing_vars+=("${var}")
        fi
    done
    
    # Report missing variables and exit
    if [ ${#missing_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_vars[@]}--gt-0-); then
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR: Required environment variables not set"
        >&2 echo "----------------------------------------------------------------"
        >&2 printf "  %s\n" "${missing_vars[@]}"
        >&2 echo "================================================================"
        exit 1
    fi
    
    # Validate PDY format (YYYYMMDD)
    if [ ! "${PDY}" =~ ^[0-9]{8}$ ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"${PDY}"-=~-^[0-9]{8}$-); then
        invalid_vars+=("PDY=${PDY} (expected YYYYMMDD format)")
    fi
    
    # Validate cyc format (HH)
    if [ ! "${cyc}" =~ ^[0-9]{2}$ ]] ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/|-[[-${cyc}--gt-23-); then
        invalid_vars+=("cyc=${cyc} (expected HH format, 00-23)")
    fi
    
    # Validate WGF value
    valid_wgf=("det" "enkf" "ensf" "firewx")
    if [ ! " ${valid_wgf[@]} " =~ " ${WGF} " ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"-${valid_wgf[@]}-"-=~-"-${WGF}-"-); then
        invalid_vars+=("WGF=${WGF} (expected: det, enkf, ensf, or firewx)")
    fi
    
    # Report invalid formats
    if [ ${#invalid_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#invalid_vars[@]}--gt-0-); then
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR: Invalid environment variable formats"
        >&2 echo "----------------------------------------------------------------"
        >&2 printf "  %s\n" "${invalid_vars[@]}"
        >&2 echo "================================================================"
        exit 1
    fi
    
    # Validate critical directories exist
    local critical_dirs=(
        "${HOMErrfs}"
        "${HOMErrfs}/exec"
        "${HOMErrfs}/fix"
        "${HOMErrfs}/parm"
        "${HOMErrfs}/ush"
        "${HOMErrfs}/scripts"
    )
    
    for dir in "${critical_dirs[@]}"; do
        if [ ! -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${dir}"-); then
            missing_dirs+=("${dir}")
        fi
    done
    
    # Report missing directories
    if [ ${#missing_dirs[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_dirs[@]}--gt-0-); then
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR: Required directories do not exist"
        >&2 echo "----------------------------------------------------------------"
        >&2 printf "  %s\n" "${missing_dirs[@]}"
        >&2 echo "================================================================"
        >&2 echo "Check that HOMErrfs is set correctly: ${HOMErrfs}"
        exit 1
    fi
    
    # Validate DATAROOT exists and is writable
    if [ ! -d "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${DATAROOT}"-); then
        >&2 echo "FATAL ERROR: DATAROOT does not exist: ${DATAROOT}"
        exit 1
    fi
    
    if [ ! -w "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--w-"${DATAROOT}"-); then
        >&2 echo "FATAL ERROR: DATAROOT is not writable: ${DATAROOT}"
        exit 1
    fi
    
    set -x  # Resume tracing
    echo "Environment validation: PASSED"
    return 0
}

# Export function
declare -fx validate_rrfs_environment

Update all job scripts (add after line 6, before variable exports):

date
export PS4='+ $SECONDS + ' 
set -xue

# Validate environment before proceeding
. ${HOMErrfs}/ush/bash_utils/validate_env.sh
validate_rrfs_environment

#-----------------------------------------------------------------------
# Specify Package Location
#-----------------------------------------------------------------------

Impact Assessment

Severity: HIGH
Files Affected: All 26 job scripts
Production Risk: Invalid paths cause cascading failures, operations on wrong directories
Estimated Fix Effort: 2 weeks (centralized validation + script updates)
Better than Global Workflow: Uses set -u so undefined variables will fail immediately

4. NO TRAP HANDLERS FOR CLEANUP ⚠️ MEDIUM-HIGH

Finding

RRFS workflow has no trap handlers for cleanup on error, interrupt, or exit - identical issue to global-workflow.

Evidence

$ grep -r "trap " rrfs-workflow/ush/bash_utils/*.sh
# No results

$ grep -r "trap " rrfs-workflow/jobs/
# No results

$ grep -r "trap " rrfs-workflow/scripts/
# No results

Job Script Pattern (All 26 Jobs)

export DATA=${DATA:-${DATAROOT}/${jobid}}
mkdir -p ${DATA}
cd ${DATA}

# ... work happens ...

# No cleanup code at all
# No trap handlers
# Temporary files/directories left behind on error

EE2 Violations

❌ No cleanup on abnormal termination (SIGINT, SIGTERM, SIGKILL)
❌ No cleanup on script errors
❌ Temporary files left in ${DATA} directory on failure
❌ No tracking of background processes
❌ Resource leaks (disk space) accumulate over time
❌ No graceful shutdown mechanism

Recommended Fix

Add to: ush/bash_utils/cleanup.sh

#!/bin/bash

# Global arrays to track resources
declare -a CLEANUP_DIRS=()
declare -a CLEANUP_FILES=()
declare -a BACKGROUND_PIDS=()

function register_cleanup_dir() {
    # Register directory for cleanup
    # Usage: register_cleanup_dir /path/to/dir
    CLEANUP_DIRS+=("$1")
}

function register_cleanup_file() {
    # Register file for cleanup
    # Usage: register_cleanup_file /path/to/file
    CLEANUP_FILES+=("$1")
}

function register_background_job() {
    # Register background PID
    # Usage: some_command & register_background_job $!
    BACKGROUND_PIDS+=("$1")
}

function cleanup_on_exit() {
    #
    # Cleanup function for exit, error, or interrupt
    #
    local exit_code=$?
    set +x
    
    echo "================================================================"
    echo "Running cleanup (exit code: ${exit_code})"
    echo "================================================================"
    
    # Kill background jobs
    if [ ${#BACKGROUND_PIDS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#BACKGROUND_PIDS[@]}--gt-0-); then
        echo "Killing ${#BACKGROUND_PIDS[@]} background job(s)..."
        for pid in "${BACKGROUND_PIDS[@]}"; do
            if kill -0 "${pid}" 2>/dev/null; then
                echo "  Killing PID ${pid}"
                kill "${pid}" 2>/dev/null || true
            fi
        done
    fi
    
    # Kill untracked background jobs
    local untracked=$(jobs -p)
    if [ -n "${untracked}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${untracked}"-); then
        echo "Killing untracked background jobs..."
        echo "${untracked}" | xargs -r kill 2>/dev/null || true
    fi
    
    # Remove temporary files
    if [ ${#CLEANUP_FILES[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_FILES[@]}--gt-0-); then
        echo "Removing ${#CLEANUP_FILES[@]} temporary file(s)..."
        for file in "${CLEANUP_FILES[@]}"; do
            [ -f "${file}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--f-"${file}"-) && rm -f "${file}" 2>/dev/null || true
        done
    fi
    
    # Remove temporary directories
    if [ "${KEEPDATA:-NO}" == "NO" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${KEEPDATA:-NO}"-==-"NO"-); then
        if [ ${#CLEANUP_DIRS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_DIRS[@]}--gt-0-); then
            echo "Removing ${#CLEANUP_DIRS[@]} temporary director(ies)..."
            for dir in "${CLEANUP_DIRS[@]}"; do
                if [ -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--d-"${dir}"-); then
                    echo "  Removing ${dir}"
                    rm -rf "${dir}" 2>/dev/null || true
                fi
            done
        fi
        
        # Clean up DATA directory
        if [ -n "${DATA:-}" && -d "${DATA}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${DATA:-}"-&&--d-"${DATA}"-); then
            echo "Removing working directory: ${DATA}"
            cd "${DATAROOT:-/tmp}" 2>/dev/null || cd /tmp
            rm -rf "${DATA}" 2>/dev/null || true
        fi
    else
        echo "KEEPDATA=YES: Preserving directories"
    fi
    
    echo "================================================================"
    echo "Cleanup complete"
    echo "================================================================"
    
    return ${exit_code}
}

function error_trap() {
    #
    # Trap handler for ERR signal
    #
    local exit_code=$?
    local line_number=$1
    
    set +x
    >&2 echo "================================================================"
    >&2 echo "ERROR TRAPPED"
    >&2 echo "----------------------------------------------------------------"
    >&2 echo "Exit code: ${exit_code}"
    >&2 echo "Line number: ${line_number}"
    >&2 echo "Script: ${BASH_SOURCE[1]:-unknown}"
    >&2 echo "================================================================"
    
    export err=${exit_code}
}

# Set trap handlers
trap cleanup_on_exit EXIT
trap 'error_trap ${LINENO}' ERR
trap 'echo "Interrupt received"; exit 130' INT
trap 'echo "Termination signal received"; exit 143' TERM

# Export functions
declare -fx register_cleanup_dir
declare -fx register_cleanup_file
declare -fx register_background_job
declare -fx cleanup_on_exit

Update: ush/source_util_funcs.sh (add cleanup sourcing)

#-----------------------------------------------------------------------
#
# Source the file containing cleanup functions and set traps
#
#-----------------------------------------------------------------------
#
  . ${bashutils_dir}/cleanup.sh

Update job scripts (add after creating DATA directory):

export DATA=${DATA:-${DATAROOT}/${jobid}}
mkdir -p ${DATA}
cd ${DATA}

# Register for cleanup
register_cleanup_dir "${DATA}"

Impact Assessment

Severity: MEDIUM-HIGH
Files Affected: All job and execution scripts
Production Risk: Disk space exhaustion, orphaned processes, resource leaks
Estimated Fix Effort: 1-2 weeks
Same Issue as Global Workflow: No advantage

5. INSUFFICIENT ERROR CONTEXT IN EXECUTION SCRIPTS ⚠️ MEDIUM

Finding

Execution scripts use print_err_msg_exit which is better than global-workflow's err_exit, but error messages lack operational context.

Example: `scripts/exrrfs_post.sh` (line 92)

  *)
    err_exit "\
Run command has not been specified for this machine:
  MACHINE = \"$MACHINE\"
  APRUN = \"$APRUN\""
    ;;

Better Error Function (Already Exists)

ush/bash_utils/print_msg.sh has comprehensive error messaging:

function print_err_msg_exit() {
  # Prints detailed error with:
  # - Calling function/script name
  # - Full file path
  # - Error message
  # - "Exiting with nonzero status"
}

EE2 Violations (Partial Compliance)

✅ Better than global-workflow - has detailed error function
✅ Shows calling script/function information
❌ No error classification (transient vs permanent)
❌ No timestamps in error messages
❌ No hostname/location information
❌ No recent output context
❌ No retry suggestions

Recommended Enhancement

Update: ush/bash_utils/print_msg.sh - Enhance print_err_msg_exit

function print_err_msg_exit() {
set -x
    
    local scrfunc_fp=$( readlink -f "${BASH_SOURCE[0]}" )
    local scrfunc_fn=$( basename "${scrfunc_fp}" )
    local scrfunc_dir=$( dirname "${scrfunc_fp}" )
    local func_name="${FUNCNAME[0]}"
    
    local caller_fp=$( readlink -f "${BASH_SOURCE[1]}" )
    local caller_fn=$( basename "${caller_fp}" )
    local caller_dir=$( dirname "${caller_fp}" )
    local caller_name="${FUNCNAME[1]}"
    
    local msg_header \
          msg_footer \
          err_msg
    
    # Enhanced header with operational context
    if [ "${caller_name}" = "main" ] || [ "${caller_name}" = "script" ]; then
        msg_header=$( printf "\n\
================================================================
FATAL ERROR
----------------------------------------------------------------
From script:  \"${caller_fn}\"
Full path:    \"${caller_fp}\"
Time:         $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Host:         ${HOSTNAME}
Working dir:  ${PWD}
User:         ${USER}
PDY/cyc:      ${PDY:-unknown}/${cyc:-unknown}
WGF:          ${WGF:-unknown}
Job:          ${job:-unknown}
================================================================
"
        )
    else
        msg_header=$( printf "\n\
================================================================
FATAL ERROR
----------------------------------------------------------------
From function:  \"${caller_name}\"
In file:        \"${caller_fn}\"
Full path:      \"${caller_fp}\"
Time:           $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Host:           ${HOSTNAME}
Working dir:    ${PWD}
User:           ${USER}
PDY/cyc:        ${PDY:-unknown}/${cyc:-unknown}
WGF:            ${WGF:-unknown}
Job:            ${job:-unknown}
================================================================
"
        )
    fi
    
    # Enhanced footer with guidance
    msg_footer=$( printf "\n\
----------------------------------------------------------------
OPERATOR ACTIONS:
  1. Check log files in: ${DATA:-unknown}
  2. Review pgmout: ${pgmout:-OUTPUT.$$}
  3. Check disk space: df -h ${DATA:-/tmp}
  4. Contact on-call developer if issue persists
================================================================
Exiting with nonzero status." )
    
    if [ "$#" -eq 0 ]; then
        err_msg=""
    elif [ "$#" -eq 1 ]; then
        err_msg="\n$1\n"
    fi
    
    # Show recent output if available
    if [ -n "${pgmout:-}" && -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${pgmout:-}"-&&--s-"${pgmout}"-); then
        err_msg+=$( printf "\n\
----------------------------------------------------------------
Last 20 lines of output (${pgmout}):
----------------------------------------------------------------\n" )
        err_msg+=$(tail -20 "${pgmout}" 2>/dev/null)
        err_msg+=$( printf "\n----------------------------------------------------------------" )
    fi
    
    printf "FATAL ERROR: ${msg_header}${err_msg}${msg_footer}\n" 1>&2
    exit 1
}

Impact Assessment

Severity: MEDIUM
Files Affected: ush/bash_utils/print_msg.sh (1 file, affects all scripts)
Production Risk: Incomplete error information slows debugging
Estimated Fix Effort: 1 week
Advantage: Already has better foundation than global-workflow

Implementation Priority and Timeline

Phase 1: Critical Safety (Weeks 1-2)

Priority: IMMEDIATE

Implement err_chk function - Scripts currently call undefined function
- Create ush/bash_utils/err_chk.sh
- Update source_util_funcs.sh to source it
- Test with all 26 job scripts
- Deliverable: err_chk function operational in all jobs

Phase 2: Python Error Handling (Weeks 3-4)

Priority: HIGH 2. Enhanced Python utilities - Fix security and error handling gaps

Update run_command.py with timeout and validation
Update filesys_cmds_vrfy.py with retry logic
Add logging throughout Python modules
Deliverable: Robust Python utility library

Phase 3: Environment Validation (Weeks 5-6)

Priority: HIGH 3. Environment variable validation - Prevent invalid path usage

Create validate_env.sh function
Update all job scripts to call validation
Test with missing/invalid variables
Deliverable: 100% validation coverage

Phase 4: Cleanup Infrastructure (Weeks 7-8)

Priority: MEDIUM-HIGH 4. Trap handlers and cleanup - Prevent resource leaks

Implement cleanup functions
Add trap handlers to all scripts
Test cleanup in failure scenarios
Deliverable: Automatic cleanup on all exit conditions

Phase 5: Error Messaging (Weeks 9-10)

Priority: MEDIUM 5. Enhanced error context - Improve debuggability

Update print_err_msg_exit with operational context
Add timestamps and location information
Include recent output in error messages
Deliverable: Comprehensive error reporting

Testing Strategy

Unit Testing

Test err_chk with various error codes
Test validation with missing/invalid variables
Test Python utilities with failures and timeouts
Test cleanup in all exit scenarios

Integration Testing

Run full RRFS workflow cycles with modifications
Test error propagation through workflow
Verify cleanup happens correctly
Test ecFlow integration

Failure Testing

Kill jobs at various stages
Introduce resource exhaustion
Test signal handling (INT, TERM)
Verify error messages are actionable

Operational Testing

Run parallel to production for 1 week
Compare error messages and MTTR
Gather operator feedback

Success Metrics

Quantitative

Zero "command not found" errors for err_chk
100% environment validation before path usage
<5 minute MTTR reduction from better error messages
Zero resource leaks after failures
100% cleanup execution on abnormal termination

Qualitative

Operators can diagnose failures without developer assistance
Error messages provide actionable guidance
Log files contain sufficient context
Scripts fail fast with clear messages
No disk space issues from accumulated temporary files

RRFS Advantages Over Global Workflow

Consistent set -xue - Catches undefined variables and command failures
Custom error functions - print_err_msg_exit provides detailed context
Verified filesystem operations - *_vrfy functions check return codes
Better Python structure - Centralized utility modules
Consistent variable naming - RRFS-specific prefix conventions

Appendix: EE2 Compliance Checklist

Critical (Must Fix)

Implement err_chk function (WEEKS 1-2)
Add environment validation (WEEKS 5-6)
Fix Python error handling (WEEKS 3-4)

High Priority

Add trap handlers and cleanup (WEEKS 7-8)
Enhance error messaging (WEEKS 9-10)

Medium Priority

Add retry logic for transient failures
Implement timeout mechanisms
Add security validation for shell commands

Best Practices

Add unit tests for utility functions
Document error codes and meanings
Create operator troubleshooting guide
Set up automated compliance checking

Document Metadata

Author: AI Analysis System (MCP RAG)
Date: November 3, 2025
Repository: NOAA-EMC/rrfs-workflow
Branch: Analyzed main branch
Analysis Tools: MCP RAG System (ChromaDB + Neo4j), grep_search, file reading
Files Analyzed: 142+ (26 jobs, 27 scripts, 35 ush, 54 Python)
Review Status: Initial Draft
Next Review: TBD

Contact and Questions

For questions about this analysis or implementation guidance, contact:

RRFS Team: [email protected]
NCO EE2 Compliance: [email protected]

END OF DOCUMENT

EE2_COMPLIANCE_ANALYSIS_RRFS - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

RRFS Workflow EE2 Compliance Analysis

Executive Summary

Comparison with Global Workflow

1. MISSING err_chk FUNCTION IMPLEMENTATION ⚠️ CRITICAL

Finding

Evidence from Job Scripts

EE2 Violations

grep Evidence

Recommended Fix

Impact Assessment

2. INADEQUATE PYTHON ERROR HANDLING ⚠️ HIGH

Finding

Example 1: ush/python_utils/run_command.py

EE2 Violations

Example 2: ush/python_utils/filesys_cmds_vrfy.py

EE2 Violations (Better than global-workflow)

Recommended Fix for run_command.py

Recommended Fix for filesys_cmds_vrfy.py

Impact Assessment

3. ENVIRONMENT VARIABLE VALIDATION GAPS ⚠️ HIGH

Finding

Example: jobs/JRRFS_FORECAST (lines 7-29)

EE2 Violations

Example: jobs/JRRFS_MAKE_GRID (lines 14-20)

Recommended Fix

Impact Assessment

4. NO TRAP HANDLERS FOR CLEANUP ⚠️ MEDIUM-HIGH

Finding

Evidence

Job Script Pattern (All 26 Jobs)

EE2 Violations

Recommended Fix

Impact Assessment

5. INSUFFICIENT ERROR CONTEXT IN EXECUTION SCRIPTS ⚠️ MEDIUM

Finding

Example: scripts/exrrfs_post.sh (line 92)

Better Error Function (Already Exists)

EE2 Violations (Partial Compliance)

Recommended Enhancement

Impact Assessment

Implementation Priority and Timeline

Phase 1: Critical Safety (Weeks 1-2)

Phase 2: Python Error Handling (Weeks 3-4)

Phase 3: Environment Validation (Weeks 5-6)

Phase 4: Cleanup Infrastructure (Weeks 7-8)

Phase 5: Error Messaging (Weeks 9-10)

Testing Strategy

Unit Testing

Integration Testing

Failure Testing

Operational Testing

Success Metrics

Quantitative

Qualitative

RRFS Advantages Over Global Workflow

Appendix: EE2 Compliance Checklist

Critical (Must Fix)

High Priority

Medium Priority

Best Practices

Document Metadata

Contact and Questions

Example 1: `ush/python_utils/run_command.py`

Example 2: `ush/python_utils/filesys_cmds_vrfy.py`

Recommended Fix for `run_command.py`

Recommended Fix for `filesys_cmds_vrfy.py`

Example: `jobs/JRRFS_FORECAST` (lines 7-29)

Example: `jobs/JRRFS_MAKE_GRID` (lines 14-20)

Example: `scripts/exrrfs_post.sh` (line 92)