EE2_COMPLIANCE_ANALYSIS_RRFS - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

RRFS Workflow EE2 Compliance Analysis

Date: November 3, 2025
Repository: NOAA-EMC/rrfs-workflow
Analysis Scope: Top 5 Critical Compliance Issues


Executive Summary

This analysis identifies the top 5 EE2 (NCEP Central Operations) compliance issues in the RRFS (Rapid Refresh Forecast System) workflow repository. Based on a comprehensive review using MCP RAG tools examining 26 job scripts, 27 execution scripts, 35+ utility scripts, and 54 Python modules, these issues represent the highest-priority remediation targets for operational readiness.

Key Finding: RRFS has BETTER baseline EE2 compliance than global-workflow, with set -xue consistently used and custom error handling utilities. However, critical gaps remain that could impact operational stability.

Total Estimated Remediation Effort: 6-10 weeks


Comparison with Global Workflow

Feature Global Workflow RRFS Workflow
Shell Error Handling set -e with && true workarounds set -xue consistently
Python Error Handling ❌ Minimal try-except ❌ Minimal try-except
Custom Error Functions err_exit from prod_util print_err_msg_exit custom
Environment Variable Validation ❌ Minimal validation ❌ Minimal validation
Trap Handlers ❌ Not implemented ❌ Not implemented
Error Context Logging ❌ Minimal context ⚠️ Partial (custom messages)

RRFS Advantage: The consistent use of set -xue and custom error messaging functions provides a stronger foundation for EE2 compliance compared to global-workflow's set -e with && true workarounds.


1. MISSING err_chk FUNCTION IMPLEMENTATION ⚠️ CRITICAL

Finding

All 26 job scripts call err_chk after script execution, but the function is not defined anywhere in the repository. This creates a critical operational risk where errors are captured but never processed.

Evidence from Job Scripts

Example 1: jobs/JRRFS_FORECAST (line 222)

${HOMErrfs}/scripts/exrrfs_forecast.sh
export err=$?; err_chk

Example 2: jobs/JRRFS_POST (line 137)

${HOMErrfs}/scripts/exrrfs_post.sh
export err=$?; err_chk

Example 3: jobs/JRRFS_ANALYSIS_GSI (line 180)

$HOMErrfs/scripts/exrrfs_analysis_gsi.sh
export err=$?; err_chk

All 26 job scripts follow this pattern:

  • JRRFS_FORECAST
  • JRRFS_POST
  • JRRFS_ANALYSIS_GSI
  • JRRFS_ANALYSIS_ENKF
  • JRRFS_ANALYSIS_GSIDIAG
  • JRRFS_ANALYSIS_NONVARCLD
  • JRRFS_BLEND_ICS
  • JRRFS_BUFRSND
  • JRRFS_CALC_ENSMEAN
  • JRRFS_FSM
  • JRRFS_GEMPAK
  • JRRFS_MAKE_GRID
  • JRRFS_MAKE_ICS
  • JRRFS_MAKE_LBCS
  • JRRFS_MAKE_OROG
  • JRRFS_MAKE_SFC_CLIMO
  • JRRFS_PRDGEN
  • JRRFS_PREP_CYC
  • JRRFS_PROCESS_BUFR
  • JRRFS_PROCESS_LIGHTNING
  • JRRFS_PROCESS_RADAR
  • JRRFS_PROCESS_SMOKE
  • JRRFS_RECENTER
  • JRRFS_SAVE_DA_OUTPUT
  • JRRFS_SAVE_RESTART
  • JRRFS_UPDATE_LBC_SOIL

EE2 Violations

  • Undefined function called in all job scripts - bash will report "command not found" and exit (due to set -u)
  • ❌ No error checking actually happens despite export err=$?
  • ❌ Scripts may fail silently if err_chk is an external command not in PATH
  • ❌ Error codes captured but never logged or acted upon
  • ❌ No classification of error types (transient vs permanent)
  • ❌ No operator guidance when errors occur

grep Evidence

$ grep -r "function err_chk" rrfs-workflow/
# No results

$ grep -r "err_chk()" rrfs-workflow/
# No results

$ grep -r "err_chk" rrfs-workflow/jobs/
# 26 matches - all calling undefined function

Recommended Fix

Create: ush/bash_utils/err_chk.sh

#!/bin/bash

function err_chk() {
    #
    # Check the error code and take appropriate action
    #
    # Usage: Called after "export err=$?"
    #
    set +x  # Don't trace error checking
    
    local err_code="${err:-0}"
    
    if [ ${err_code} -ne 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--ne-0-); then
        # Get calling script information
        local caller_fp=$( readlink -f "${BASH_SOURCE[1]}" )
        local caller_fn=$( basename "${caller_fp}" )
        local caller_line="${BASH_LINENO[0]}"
        
        # Build comprehensive error message
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR DETECTED"
        >&2 echo "----------------------------------------------------------------"
        >&2 echo "Script: ${caller_fn}"
        >&2 echo "Location: ${caller_fp}:${caller_line}"
        >&2 echo "Exit code: ${err_code}"
        >&2 echo "Time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
        >&2 echo "Host: ${HOSTNAME}"
        >&2 echo "Working directory: ${PWD}"
        >&2 echo "----------------------------------------------------------------"
        
        # Classify error type
        if [ ${err_code} -eq 137 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-137-); then
            >&2 echo "ERROR TYPE: Process killed (SIGKILL)"
            >&2 echo "LIKELY CAUSE: Out of memory or walltime exceeded"
            >&2 echo "RECOMMENDATION: Check resource allocations"
        elif [ ${err_code} -eq 143 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-143-); then
            >&2 echo "ERROR TYPE: Process terminated (SIGTERM)"
            >&2 echo "LIKELY CAUSE: Job cancelled or walltime limit"
        elif [ ${err_code} -ge 128 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--ge-128-); then
            local signal=$((err_code - 128))
            >&2 echo "ERROR TYPE: Terminated by signal ${signal}"
        elif [ ${err_code} -eq 1 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err_code}--eq-1-); then
            >&2 echo "ERROR TYPE: General script error"
        else
            >&2 echo "ERROR TYPE: Script-specific error code ${err_code}"
        fi
        
        # Display recent output if available
        if [ -n "${pgmout:-}" && -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${pgmout:-}"-&&--s-"${pgmout}"-); then
            >&2 echo "----------------------------------------------------------------"
            >&2 echo "Last 30 lines of output:"
            >&2 echo "----------------------------------------------------------------"
            >&2 tail -30 "${pgmout}"
        fi
        
        >&2 echo "================================================================"
        
        # Send alert to workflow manager if configured
        if [ "${WORKFLOW_MANAGER:-}" == "ecflow" && -n "${ECF_NAME:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${WORKFLOW_MANAGER:-}"-==-"ecflow"-&&--n-"${ECF_NAME:-}"-); then
            ecflow_client --abort="${ECF_NAME}: Script failed with error ${err_code}"
        fi
        
        # Exit with the original error code
        exit ${err_code}
    fi
    
    set -x  # Resume tracing
    return 0
}

# Export function for use in all scripts
declare -fx err_chk

Update: ush/source_util_funcs.sh (add after line 122)

#-----------------------------------------------------------------------
#
# Source the file containing the err_chk function
#
#-----------------------------------------------------------------------
#
  . ${bashutils_dir}/err_chk.sh

Impact Assessment

  • Severity: CRITICAL
  • Files Affected: All 26 job scripts
  • Production Risk: Errors currently go undetected - scripts may fail with cryptic "command not found" messages or silently continue
  • Estimated Fix Effort: 1-2 days (create function + testing)
  • Priority: IMMEDIATE - This is a fundamental operational safety issue

2. INADEQUATE PYTHON ERROR HANDLING ⚠️ HIGH

Finding

Python utility modules lack comprehensive exception handling, following the same pattern as global-workflow but with better utility structure.

Example 1: ush/python_utils/run_command.py

def run_command(cmd):
    """Run system command in a subprocess

    Args:
        cmd: command to execute
    Returns:
        Tuple of (exit code, std_out, std_err)
    """
    proc = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        shell=True,
        universal_newlines=True,
    )

    std_out, std_err = proc.communicate()

    # strip trailing newline character
    return (proc.returncode, std_out.rstrip("\n"), std_err.rstrip("\n"))

EE2 Violations

  • ❌ No exception handling for subprocess.Popen failures
  • ❌ No timeout mechanism for long-running commands
  • ❌ No validation that cmd is non-empty or safe
  • ❌ Shell injection vulnerability (shell=True without sanitization)
  • ❌ No logging of command execution
  • ❌ Return code not checked by caller

Example 2: ush/python_utils/filesys_cmds_vrfy.py

def cmd_vrfy(cmd, *args):
    """Execute system command

    Args:
        cmd: the command
        *args: its arguments
    Returns:
        Exit code
    """

    cmd += " " + " ".join([str(a) for a in args])
    ret = os.system(cmd)
    if ret != 0:
        print_err_msg_exit(f"System call '{cmd}' failed.")
    return ret

EE2 Violations (Better than global-workflow)

  • Does check return code and calls error exit
  • Has error message with command context
  • ❌ Uses deprecated os.system() instead of subprocess
  • ❌ Shell injection vulnerability (command string concatenation)
  • ❌ No distinction between different error types
  • ❌ No retry logic for transient failures

Recommended Fix for run_command.py

#!/usr/bin/env python3

import subprocess
import shlex
import logging
from .print_msg import print_err_msg_exit

logger = logging.getLogger(__name__)


def run_command(cmd, timeout=None, check=False, shell=False, safe_shell=True):
    """Run system command in a subprocess with proper error handling
    
    Args:
        cmd: command to execute (string if shell=True, list otherwise)
        timeout: maximum execution time in seconds (default: no timeout)
        check: if True, raise exception on non-zero exit code
        shell: if True, execute through shell (security risk - use sparingly)
        safe_shell: if True and shell=True, validate command safety
        
    Returns:
        Tuple of (exit code, std_out, std_err)
        
    Raises:
        ValueError: if command is invalid
        subprocess.TimeoutExpired: if timeout exceeded
        subprocess.CalledProcessError: if check=True and command fails
    """
    
    # Validate input
    if not cmd:
        raise ValueError("Command cannot be empty")
    
    # Log command execution
    logger.info(f"Executing command: {cmd}")
    
    # Security check for shell commands
    if shell and safe_shell:
        if isinstance(cmd, str):
            # Check for dangerous patterns
            dangerous_patterns = [';', '&&', '||', '|', '>', '<', '`', '$(',  '$(']
            if any(pattern in cmd for pattern in dangerous_patterns):
                logger.warning(f"Potentially unsafe shell command: {cmd}")
    
    try:
        proc = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            shell=shell,
            universal_newlines=True,
            text=True
        )
        
        try:
            std_out, std_err = proc.communicate(timeout=timeout)
        except subprocess.TimeoutExpired:
            proc.kill()
            std_out, std_err = proc.communicate()
            logger.error(f"Command timed out after {timeout}s: {cmd}")
            raise
        
        # Strip trailing newlines
        std_out = std_out.rstrip("\n") if std_out else ""
        std_err = std_err.rstrip("\n") if std_err else ""
        
        # Log results
        if proc.returncode != 0:
            logger.error(f"Command failed with exit code {proc.returncode}")
            logger.error(f"stderr: {std_err}")
            
            if check:
                raise subprocess.CalledProcessError(
                    proc.returncode, cmd, std_out, std_err
                )
        
        return (proc.returncode, std_out, std_err)
        
    except OSError as e:
        logger.error(f"Failed to execute command: {e}")
        print_err_msg_exit(f"Command execution failed: {cmd}\nError: {e}")
    except Exception as e:
        logger.error(f"Unexpected error executing command: {e}")
        raise


def run_command_safe(cmd_list, **kwargs):
    """Run command with automatic argument escaping (no shell injection)
    
    Args:
        cmd_list: list of command and arguments
        **kwargs: passed to run_command
        
    Returns:
        Tuple of (exit code, std_out, std_err)
    """
    if isinstance(cmd_list, str):
        # Parse string into safe list
        cmd_list = shlex.split(cmd_list)
    
    return run_command(cmd_list, shell=False, **kwargs)

Recommended Fix for filesys_cmds_vrfy.py

#!/usr/bin/env python3

import subprocess
import shlex
import logging
from .print_msg import print_err_msg_exit

logger = logging.getLogger(__name__)


def cmd_vrfy(cmd, *args, retry_count=0, retry_delay=1):
    """Execute system command with retry logic for transient failures

    Args:
        cmd: the command
        *args: its arguments
        retry_count: number of retries for transient failures
        retry_delay: seconds to wait between retries
        
    Returns:
        Exit code (always 0 on success, exits on permanent failure)
        
    Raises:
        SystemExit: on permanent failure or exceeded retries
    """
    import time
    
    # Build command list
    cmd_list = [cmd] + [str(a) for a in args]
    
    # Execute with retries
    for attempt in range(retry_count + 1):
        try:
            result = subprocess.run(
                cmd_list,
                capture_output=True,
                text=True,
                check=False
            )
            
            if result.returncode == 0:
                if result.stdout:
                    logger.info(result.stdout)
                return 0
            
            # Check if error is transient
            is_transient = False
            if result.returncode in [errno.EAGAIN, errno.EBUSY, errno.ETIMEDOUT]:
                is_transient = True
            elif "temporarily unavailable" in result.stderr.lower():
                is_transient = True
            elif "resource busy" in result.stderr.lower():
                is_transient = True
            
            if is_transient and attempt < retry_count:
                logger.warning(
                    f"Transient error on attempt {attempt + 1}/{retry_count + 1}, "
                    f"retrying in {retry_delay}s..."
                )
                time.sleep(retry_delay)
                continue
            
            # Permanent failure or retries exhausted
            error_msg = f"System call failed: {' '.join(cmd_list)}\n"
            error_msg += f"Exit code: {result.returncode}\n"
            if result.stderr:
                error_msg += f"Error output: {result.stderr}"
            
            print_err_msg_exit(error_msg)
            
        except FileNotFoundError:
            print_err_msg_exit(f"Command not found: {cmd}")
        except PermissionError:
            print_err_msg_exit(f"Permission denied: {cmd}")
        except Exception as e:
            print_err_msg_exit(f"Unexpected error executing {cmd}: {e}")
    
    return 0  # Never reached due to print_err_msg_exit


def cp_vrfy(*args):
    return cmd_vrfy("cp", *args, retry_count=2)


def rsync_vrfy(*args):
    return cmd_vrfy("rsync", *args, retry_count=3)


def mv_vrfy(*args):
    return cmd_vrfy("mv", *args, retry_count=2)


def rm_vrfy(*args):
    return cmd_vrfy("rm", *args)


def ln_vrfy(*args):
    return cmd_vrfy("ln", *args)


def mkdir_vrfy(*args):
    return cmd_vrfy("mkdir", *args, retry_count=1)

Impact Assessment

  • Severity: HIGH
  • Files Affected: 54 Python scripts/modules
  • Production Risk: Unhandled exceptions, no retry for transient failures, security vulnerabilities
  • Estimated Fix Effort: 2-3 weeks
  • Advantage over Global Workflow: Already has cmd_vrfy checking return codes

3. ENVIRONMENT VARIABLE VALIDATION GAPS ⚠️ HIGH

Finding

Job scripts use environment variables with defaults but without validation, similar to global-workflow but with better consistency.

Example: jobs/JRRFS_FORECAST (lines 7-29)

export NET='rrfs'
export MACHINE="WCOSS2"
export WORKFLOW_MANAGER="ecflow"
export HOMErrfs=${HOMErrfs:-""}
export EXECrrfs=${EXECrrfs:-${HOMErrfs}/exec}
export FIXrrfs=${FIXrrfs:-${HOMErrfs}/fix}
export PARMrrfs=${PARMrrfs:-${HOMErrfs}/parm}
export USHrrfs=${USHrrfs:-${HOMErrfs}/ush}
export SORCrrfs=${SORCrrfs:-${HOMErrfs}/sorc}
export FIXam=${FIXam:-${HOMErrfs}/fix/am}
export FIXLAM=${FIXLAM:-${HOMErrfs}/fix/lam/RRFS_NA_3km}
export FIXgsm=${FIXgsm:-${HOMErrfs}/fix/am}
export FIX_GSI="${FIXrrfs}/gsi"
export FIX_UPP="${FIXrrfs}/upp"
export FIXprdgen="${FIXrrfs}/prdgen"
export FIX_SMOKE_DUST="${FIXrrfs}/smoke_dust"
export GRID_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
export OROG_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"
export SFC_CLIMO_DIR="${HOMErrfs}/fix/lam/RRFS_NA_3km"

EE2 Violations

  • HOMErrfs defaults to empty string "" instead of failing
  • ❌ All dependent paths become invalid if HOMErrfs="" (e.g., /exec, /fix)
  • ❌ No validation that directories exist before use
  • MACHINE hardcoded to "WCOSS2" - no runtime detection
  • WGF variable used without validation (lines 68-78, 94)
  • cyc, PDY, CDATE used without validation

Example: jobs/JRRFS_MAKE_GRID (lines 14-20)

export HOMErrfs=${HOMErrfs:-""}
export EXECrrfs=${EXECrrfs:-${HOMErrfs}/exec}
export FIXrrfs=${FIXrrfs:-${HOMErrfs}/fix}
export PARMrrfs=${PARMrrfs:-${HOMErrfs}/parm}
export USHrrfs=${USHrrfs:-${HOMErrfs}/ush}
export FIXam=${FIXam:-${HOMErrfs}/fix/am}
export FIXLAM=${FIXLAM:-${HOMErrfs}/fix/lam/RRFS_NA_3km}

Same pattern repeated in all 26 job scripts!

Recommended Fix

Create: ush/bash_utils/validate_env.sh

#!/bin/bash

function validate_rrfs_environment() {
    #
    # Validate critical RRFS environment variables
    #
    # Usage: Call at start of job scripts before any path usage
    #
    set +x  # Don't trace validation
    
    local missing_vars=()
    local invalid_vars=()
    local missing_dirs=()
    
    # Critical variables that must be set
    local required_vars=(
        "HOMErrfs"
        "cyc"
        "PDY"
        "WGF"
        "DATAROOT"
        "job"
    )
    
    # Check required variables are non-empty
    for var in "${required_vars[@]}"; do
        if [ -z "${!var:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--z-"${!var:-}"-); then
            missing_vars+=("${var}")
        fi
    done
    
    # Report missing variables and exit
    if [ ${#missing_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_vars[@]}--gt-0-); then
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR: Required environment variables not set"
        >&2 echo "----------------------------------------------------------------"
        >&2 printf "  %s\n" "${missing_vars[@]}"
        >&2 echo "================================================================"
        exit 1
    fi
    
    # Validate PDY format (YYYYMMDD)
    if [ ! "${PDY}" =~ ^[0-9]{8}$ ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"${PDY}"-=~-^[0-9]{8}$-); then
        invalid_vars+=("PDY=${PDY} (expected YYYYMMDD format)")
    fi
    
    # Validate cyc format (HH)
    if [ ! "${cyc}" =~ ^[0-9]{2}$ ]] ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/|-[[-${cyc}--gt-23-); then
        invalid_vars+=("cyc=${cyc} (expected HH format, 00-23)")
    fi
    
    # Validate WGF value
    valid_wgf=("det" "enkf" "ensf" "firewx")
    if [ ! " ${valid_wgf[@]} " =~ " ${WGF} " ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"-${valid_wgf[@]}-"-=~-"-${WGF}-"-); then
        invalid_vars+=("WGF=${WGF} (expected: det, enkf, ensf, or firewx)")
    fi
    
    # Report invalid formats
    if [ ${#invalid_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#invalid_vars[@]}--gt-0-); then
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR: Invalid environment variable formats"
        >&2 echo "----------------------------------------------------------------"
        >&2 printf "  %s\n" "${invalid_vars[@]}"
        >&2 echo "================================================================"
        exit 1
    fi
    
    # Validate critical directories exist
    local critical_dirs=(
        "${HOMErrfs}"
        "${HOMErrfs}/exec"
        "${HOMErrfs}/fix"
        "${HOMErrfs}/parm"
        "${HOMErrfs}/ush"
        "${HOMErrfs}/scripts"
    )
    
    for dir in "${critical_dirs[@]}"; do
        if [ ! -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${dir}"-); then
            missing_dirs+=("${dir}")
        fi
    done
    
    # Report missing directories
    if [ ${#missing_dirs[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_dirs[@]}--gt-0-); then
        >&2 echo "================================================================"
        >&2 echo "FATAL ERROR: Required directories do not exist"
        >&2 echo "----------------------------------------------------------------"
        >&2 printf "  %s\n" "${missing_dirs[@]}"
        >&2 echo "================================================================"
        >&2 echo "Check that HOMErrfs is set correctly: ${HOMErrfs}"
        exit 1
    fi
    
    # Validate DATAROOT exists and is writable
    if [ ! -d "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${DATAROOT}"-); then
        >&2 echo "FATAL ERROR: DATAROOT does not exist: ${DATAROOT}"
        exit 1
    fi
    
    if [ ! -w "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--w-"${DATAROOT}"-); then
        >&2 echo "FATAL ERROR: DATAROOT is not writable: ${DATAROOT}"
        exit 1
    fi
    
    set -x  # Resume tracing
    echo "Environment validation: PASSED"
    return 0
}

# Export function
declare -fx validate_rrfs_environment

Update all job scripts (add after line 6, before variable exports):

date
export PS4='+ $SECONDS + ' 
set -xue

# Validate environment before proceeding
. ${HOMErrfs}/ush/bash_utils/validate_env.sh
validate_rrfs_environment

#-----------------------------------------------------------------------
# Specify Package Location
#-----------------------------------------------------------------------

Impact Assessment

  • Severity: HIGH
  • Files Affected: All 26 job scripts
  • Production Risk: Invalid paths cause cascading failures, operations on wrong directories
  • Estimated Fix Effort: 2 weeks (centralized validation + script updates)
  • Better than Global Workflow: Uses set -u so undefined variables will fail immediately

4. NO TRAP HANDLERS FOR CLEANUP ⚠️ MEDIUM-HIGH

Finding

RRFS workflow has no trap handlers for cleanup on error, interrupt, or exit - identical issue to global-workflow.

Evidence

$ grep -r "trap " rrfs-workflow/ush/bash_utils/*.sh
# No results

$ grep -r "trap " rrfs-workflow/jobs/
# No results

$ grep -r "trap " rrfs-workflow/scripts/
# No results

Job Script Pattern (All 26 Jobs)

export DATA=${DATA:-${DATAROOT}/${jobid}}
mkdir -p ${DATA}
cd ${DATA}

# ... work happens ...

# No cleanup code at all
# No trap handlers
# Temporary files/directories left behind on error

EE2 Violations

  • ❌ No cleanup on abnormal termination (SIGINT, SIGTERM, SIGKILL)
  • ❌ No cleanup on script errors
  • ❌ Temporary files left in ${DATA} directory on failure
  • ❌ No tracking of background processes
  • ❌ Resource leaks (disk space) accumulate over time
  • ❌ No graceful shutdown mechanism

Recommended Fix

Add to: ush/bash_utils/cleanup.sh

#!/bin/bash

# Global arrays to track resources
declare -a CLEANUP_DIRS=()
declare -a CLEANUP_FILES=()
declare -a BACKGROUND_PIDS=()

function register_cleanup_dir() {
    # Register directory for cleanup
    # Usage: register_cleanup_dir /path/to/dir
    CLEANUP_DIRS+=("$1")
}

function register_cleanup_file() {
    # Register file for cleanup
    # Usage: register_cleanup_file /path/to/file
    CLEANUP_FILES+=("$1")
}

function register_background_job() {
    # Register background PID
    # Usage: some_command & register_background_job $!
    BACKGROUND_PIDS+=("$1")
}

function cleanup_on_exit() {
    #
    # Cleanup function for exit, error, or interrupt
    #
    local exit_code=$?
    set +x
    
    echo "================================================================"
    echo "Running cleanup (exit code: ${exit_code})"
    echo "================================================================"
    
    # Kill background jobs
    if [ ${#BACKGROUND_PIDS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#BACKGROUND_PIDS[@]}--gt-0-); then
        echo "Killing ${#BACKGROUND_PIDS[@]} background job(s)..."
        for pid in "${BACKGROUND_PIDS[@]}"; do
            if kill -0 "${pid}" 2>/dev/null; then
                echo "  Killing PID ${pid}"
                kill "${pid}" 2>/dev/null || true
            fi
        done
    fi
    
    # Kill untracked background jobs
    local untracked=$(jobs -p)
    if [ -n "${untracked}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${untracked}"-); then
        echo "Killing untracked background jobs..."
        echo "${untracked}" | xargs -r kill 2>/dev/null || true
    fi
    
    # Remove temporary files
    if [ ${#CLEANUP_FILES[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_FILES[@]}--gt-0-); then
        echo "Removing ${#CLEANUP_FILES[@]} temporary file(s)..."
        for file in "${CLEANUP_FILES[@]}"; do
            [ -f "${file}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--f-"${file}"-) && rm -f "${file}" 2>/dev/null || true
        done
    fi
    
    # Remove temporary directories
    if [ "${KEEPDATA:-NO}" == "NO" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${KEEPDATA:-NO}"-==-"NO"-); then
        if [ ${#CLEANUP_DIRS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_DIRS[@]}--gt-0-); then
            echo "Removing ${#CLEANUP_DIRS[@]} temporary director(ies)..."
            for dir in "${CLEANUP_DIRS[@]}"; do
                if [ -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--d-"${dir}"-); then
                    echo "  Removing ${dir}"
                    rm -rf "${dir}" 2>/dev/null || true
                fi
            done
        fi
        
        # Clean up DATA directory
        if [ -n "${DATA:-}" && -d "${DATA}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${DATA:-}"-&&--d-"${DATA}"-); then
            echo "Removing working directory: ${DATA}"
            cd "${DATAROOT:-/tmp}" 2>/dev/null || cd /tmp
            rm -rf "${DATA}" 2>/dev/null || true
        fi
    else
        echo "KEEPDATA=YES: Preserving directories"
    fi
    
    echo "================================================================"
    echo "Cleanup complete"
    echo "================================================================"
    
    return ${exit_code}
}

function error_trap() {
    #
    # Trap handler for ERR signal
    #
    local exit_code=$?
    local line_number=$1
    
    set +x
    >&2 echo "================================================================"
    >&2 echo "ERROR TRAPPED"
    >&2 echo "----------------------------------------------------------------"
    >&2 echo "Exit code: ${exit_code}"
    >&2 echo "Line number: ${line_number}"
    >&2 echo "Script: ${BASH_SOURCE[1]:-unknown}"
    >&2 echo "================================================================"
    
    export err=${exit_code}
}

# Set trap handlers
trap cleanup_on_exit EXIT
trap 'error_trap ${LINENO}' ERR
trap 'echo "Interrupt received"; exit 130' INT
trap 'echo "Termination signal received"; exit 143' TERM

# Export functions
declare -fx register_cleanup_dir
declare -fx register_cleanup_file
declare -fx register_background_job
declare -fx cleanup_on_exit

Update: ush/source_util_funcs.sh (add cleanup sourcing)

#-----------------------------------------------------------------------
#
# Source the file containing cleanup functions and set traps
#
#-----------------------------------------------------------------------
#
  . ${bashutils_dir}/cleanup.sh

Update job scripts (add after creating DATA directory):

export DATA=${DATA:-${DATAROOT}/${jobid}}
mkdir -p ${DATA}
cd ${DATA}

# Register for cleanup
register_cleanup_dir "${DATA}"

Impact Assessment

  • Severity: MEDIUM-HIGH
  • Files Affected: All job and execution scripts
  • Production Risk: Disk space exhaustion, orphaned processes, resource leaks
  • Estimated Fix Effort: 1-2 weeks
  • Same Issue as Global Workflow: No advantage

5. INSUFFICIENT ERROR CONTEXT IN EXECUTION SCRIPTS ⚠️ MEDIUM

Finding

Execution scripts use print_err_msg_exit which is better than global-workflow's err_exit, but error messages lack operational context.

Example: scripts/exrrfs_post.sh (line 92)

  *)
    err_exit "\
Run command has not been specified for this machine:
  MACHINE = \"$MACHINE\"
  APRUN = \"$APRUN\""
    ;;

Better Error Function (Already Exists)

ush/bash_utils/print_msg.sh has comprehensive error messaging:

function print_err_msg_exit() {
  # Prints detailed error with:
  # - Calling function/script name
  # - Full file path
  # - Error message
  # - "Exiting with nonzero status"
}

EE2 Violations (Partial Compliance)

  • Better than global-workflow - has detailed error function
  • ✅ Shows calling script/function information
  • ❌ No error classification (transient vs permanent)
  • ❌ No timestamps in error messages
  • ❌ No hostname/location information
  • ❌ No recent output context
  • ❌ No retry suggestions

Recommended Enhancement

Update: ush/bash_utils/print_msg.sh - Enhance print_err_msg_exit

function print_err_msg_exit() {
set -x
    
    local scrfunc_fp=$( readlink -f "${BASH_SOURCE[0]}" )
    local scrfunc_fn=$( basename "${scrfunc_fp}" )
    local scrfunc_dir=$( dirname "${scrfunc_fp}" )
    local func_name="${FUNCNAME[0]}"
    
    local caller_fp=$( readlink -f "${BASH_SOURCE[1]}" )
    local caller_fn=$( basename "${caller_fp}" )
    local caller_dir=$( dirname "${caller_fp}" )
    local caller_name="${FUNCNAME[1]}"
    
    local msg_header \
          msg_footer \
          err_msg
    
    # Enhanced header with operational context
    if [ "${caller_name}" = "main" ] || [ "${caller_name}" = "script" ]; then
        msg_header=$( printf "\n\
================================================================
FATAL ERROR
----------------------------------------------------------------
From script:  \"${caller_fn}\"
Full path:    \"${caller_fp}\"
Time:         $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Host:         ${HOSTNAME}
Working dir:  ${PWD}
User:         ${USER}
PDY/cyc:      ${PDY:-unknown}/${cyc:-unknown}
WGF:          ${WGF:-unknown}
Job:          ${job:-unknown}
================================================================
"
        )
    else
        msg_header=$( printf "\n\
================================================================
FATAL ERROR
----------------------------------------------------------------
From function:  \"${caller_name}\"
In file:        \"${caller_fn}\"
Full path:      \"${caller_fp}\"
Time:           $(date -u '+%Y-%m-%d %H:%M:%S UTC')
Host:           ${HOSTNAME}
Working dir:    ${PWD}
User:           ${USER}
PDY/cyc:        ${PDY:-unknown}/${cyc:-unknown}
WGF:            ${WGF:-unknown}
Job:            ${job:-unknown}
================================================================
"
        )
    fi
    
    # Enhanced footer with guidance
    msg_footer=$( printf "\n\
----------------------------------------------------------------
OPERATOR ACTIONS:
  1. Check log files in: ${DATA:-unknown}
  2. Review pgmout: ${pgmout:-OUTPUT.$$}
  3. Check disk space: df -h ${DATA:-/tmp}
  4. Contact on-call developer if issue persists
================================================================
Exiting with nonzero status." )
    
    if [ "$#" -eq 0 ]; then
        err_msg=""
    elif [ "$#" -eq 1 ]; then
        err_msg="\n$1\n"
    fi
    
    # Show recent output if available
    if [ -n "${pgmout:-}" && -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${pgmout:-}"-&&--s-"${pgmout}"-); then
        err_msg+=$( printf "\n\
----------------------------------------------------------------
Last 20 lines of output (${pgmout}):
----------------------------------------------------------------\n" )
        err_msg+=$(tail -20 "${pgmout}" 2>/dev/null)
        err_msg+=$( printf "\n----------------------------------------------------------------" )
    fi
    
    printf "FATAL ERROR: ${msg_header}${err_msg}${msg_footer}\n" 1>&2
    exit 1
}

Impact Assessment

  • Severity: MEDIUM
  • Files Affected: ush/bash_utils/print_msg.sh (1 file, affects all scripts)
  • Production Risk: Incomplete error information slows debugging
  • Estimated Fix Effort: 1 week
  • Advantage: Already has better foundation than global-workflow

Implementation Priority and Timeline

Phase 1: Critical Safety (Weeks 1-2)

Priority: IMMEDIATE

  1. Implement err_chk function - Scripts currently call undefined function
    • Create ush/bash_utils/err_chk.sh
    • Update source_util_funcs.sh to source it
    • Test with all 26 job scripts
    • Deliverable: err_chk function operational in all jobs

Phase 2: Python Error Handling (Weeks 3-4)

Priority: HIGH 2. Enhanced Python utilities - Fix security and error handling gaps

  • Update run_command.py with timeout and validation
  • Update filesys_cmds_vrfy.py with retry logic
  • Add logging throughout Python modules
  • Deliverable: Robust Python utility library

Phase 3: Environment Validation (Weeks 5-6)

Priority: HIGH 3. Environment variable validation - Prevent invalid path usage

  • Create validate_env.sh function
  • Update all job scripts to call validation
  • Test with missing/invalid variables
  • Deliverable: 100% validation coverage

Phase 4: Cleanup Infrastructure (Weeks 7-8)

Priority: MEDIUM-HIGH 4. Trap handlers and cleanup - Prevent resource leaks

  • Implement cleanup functions
  • Add trap handlers to all scripts
  • Test cleanup in failure scenarios
  • Deliverable: Automatic cleanup on all exit conditions

Phase 5: Error Messaging (Weeks 9-10)

Priority: MEDIUM 5. Enhanced error context - Improve debuggability

  • Update print_err_msg_exit with operational context
  • Add timestamps and location information
  • Include recent output in error messages
  • Deliverable: Comprehensive error reporting

Testing Strategy

Unit Testing

  • Test err_chk with various error codes
  • Test validation with missing/invalid variables
  • Test Python utilities with failures and timeouts
  • Test cleanup in all exit scenarios

Integration Testing

  • Run full RRFS workflow cycles with modifications
  • Test error propagation through workflow
  • Verify cleanup happens correctly
  • Test ecFlow integration

Failure Testing

  • Kill jobs at various stages
  • Introduce resource exhaustion
  • Test signal handling (INT, TERM)
  • Verify error messages are actionable

Operational Testing

  • Run parallel to production for 1 week
  • Compare error messages and MTTR
  • Gather operator feedback

Success Metrics

Quantitative

  • Zero "command not found" errors for err_chk
  • 100% environment validation before path usage
  • <5 minute MTTR reduction from better error messages
  • Zero resource leaks after failures
  • 100% cleanup execution on abnormal termination

Qualitative

  • Operators can diagnose failures without developer assistance
  • Error messages provide actionable guidance
  • Log files contain sufficient context
  • Scripts fail fast with clear messages
  • No disk space issues from accumulated temporary files

RRFS Advantages Over Global Workflow

  1. Consistent set -xue - Catches undefined variables and command failures
  2. Custom error functions - print_err_msg_exit provides detailed context
  3. Verified filesystem operations - *_vrfy functions check return codes
  4. Better Python structure - Centralized utility modules
  5. Consistent variable naming - RRFS-specific prefix conventions

Appendix: EE2 Compliance Checklist

Critical (Must Fix)

  • Implement err_chk function (WEEKS 1-2)
  • Add environment validation (WEEKS 5-6)
  • Fix Python error handling (WEEKS 3-4)

High Priority

  • Add trap handlers and cleanup (WEEKS 7-8)
  • Enhance error messaging (WEEKS 9-10)

Medium Priority

  • Add retry logic for transient failures
  • Implement timeout mechanisms
  • Add security validation for shell commands

Best Practices

  • Add unit tests for utility functions
  • Document error codes and meanings
  • Create operator troubleshooting guide
  • Set up automated compliance checking

Document Metadata

  • Author: AI Analysis System (MCP RAG)
  • Date: November 3, 2025
  • Repository: NOAA-EMC/rrfs-workflow
  • Branch: Analyzed main branch
  • Analysis Tools: MCP RAG System (ChromaDB + Neo4j), grep_search, file reading
  • Files Analyzed: 142+ (26 jobs, 27 scripts, 35 ush, 54 Python)
  • Review Status: Initial Draft
  • Next Review: TBD

Contact and Questions

For questions about this analysis or implementation guidance, contact:


END OF DOCUMENT