EE2_COMPLIANCE_ANALYSIS_GLOBAL_WORKFLOW - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Global Workflow EE2 Compliance Analysis
Date: November 3, 2025
Repository: NOAA-EMC/global-workflow (forked)
Analysis Scope: Top 5 Critical Compliance Issues
Executive Summary
This analysis identifies the top 5 EE2 (NCEP Central Operations) compliance issues in the global-workflow repository. Based on a comprehensive review of 172+ job scripts, 83+ execution scripts, and supporting utilities, these issues represent the highest-priority remediation targets for operational readiness.
Total Estimated Remediation Effort: 9-14 weeks
1. INADEQUATE ERROR HANDLING IN PYTHON SCRIPTS ⚠️ CRITICAL
Finding
Python execution scripts lack comprehensive try-except blocks and specific error handling, violating EE2 requirements for explicit error management and graceful failure modes.
Example: scripts/exglobal_forecast.py
@logit(logger)
def main():
config = cast_strdict_as_dtypedict(os.environ)
save_as_yaml(config, f'{config.EXPDIR}/fcst.yaml')
fcst = GFSForecast(config)
fcst.initialize()
fcst.configure()
if __name__ == '__main__':
main()
EE2 Violations
- ❌ No try-except blocks for exception handling
- ❌ No validation of critical environment variables before use
- ❌ Missing error logging with specific error codes
- ❌ No graceful degradation or cleanup on failure
- ❌
config.EXPDIRaccessed without validation (KeyError risk) - ❌ Missing return code to shell environment
Recommended Fix
#!/usr/bin/env python3
import os
import sys
import logging
from wxflow import Logger, logit, save_as_yaml, cast_strdict_as_dtypedict
from pygfs.task.gfs_forecast import GFSForecast
# Initialize root logger
logger = Logger(level=os.environ.get("LOGGING_LEVEL", "INFO"), colored_log=True)
@logit(logger)
def main():
"""
Main entry point for GFS forecast task.
Returns:
int: Exit code (0 for success, non-zero for failure)
"""
try:
# Validate critical environment variables
required_vars = ['EXPDIR', 'PDY', 'cyc', 'HOMEgfs', 'DATA', 'RUN']
missing_vars = [var for var in required_vars if var not in os.environ]
if missing_vars:
logger.error(f"FATAL: Required environment variables not set: {', '.join(missing_vars)}")
return 1
# Cast environment to configuration dictionary
config = cast_strdict_as_dtypedict(os.environ)
# Validate critical config keys
if 'EXPDIR' not in config:
raise ValueError("EXPDIR not found in configuration after casting")
# Save configuration for debugging
config_file = f'{config.EXPDIR}/fcst.yaml'
logger.info(f"Saving configuration to {config_file}")
save_as_yaml(config, config_file)
# Instantiate and run forecast
logger.info("Initializing GFS forecast task")
fcst = GFSForecast(config)
logger.info("Configuring forecast")
fcst.initialize()
fcst.configure()
logger.info("Forecast configuration completed successfully")
return 0
except KeyError as e:
logger.error(f"Configuration key error: {e}")
logger.error("Check that all required environment variables are set")
return 2
except ValueError as e:
logger.error(f"Configuration value error: {e}")
return 3
except IOError as e:
logger.error(f"File I/O error: {e}")
logger.error(f"Check permissions and disk space in {os.environ.get('EXPDIR', 'unknown')}")
return 4
except Exception as e:
logger.error(f"Unexpected error in forecast task: {e}")
logger.exception("Full traceback:")
return 99
if __name__ == '__main__':
sys.exit(main())
Impact Assessment
- Severity: CRITICAL
- Files Affected: ~42 Python execution scripts in
scripts/exglobal_*.pyandscripts/exgdas_*.py - Production Risk: Unhandled exceptions cause silent failures or cryptic error messages, leading to delayed incident response
- Estimated Fix Effort: 2-3 weeks (pattern can be templated and applied systematically)
2. INCONSISTENT ERROR EXIT HANDLING IN SHELL SCRIPTS ⚠️ HIGH
Finding
Job scripts use the && true pattern to prevent set -e from terminating on errors, but then fail to provide adequate error context or classification when errors are detected.
Example: jobs/JGLOBAL_FORECAST (lines 110-115)
###############################################################
# Run relevant exglobal script
###############################################################
"${FORECASTSH:-${SCRgfs}/exglobal_forecast.sh}" && true # The && true prevents the shell from exiting when set -e
export err=$?
if [ ${err} -ne 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err}--ne-0-); then
err_exit
fi
Example: jobs/JGLOBAL_ATM_ANALYSIS_INITIALIZE (lines 32-36)
###############################################################
# Run relevant script
EXSCRIPT=${GDASATMINITPY:-${SCRgfs}/exglobal_atm_analysis_initialize.py}
${EXSCRIPT} && true
export err=$?
if [ ${err} -ne 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err}--ne-0-); then
err_exit
fi
EE2 Violations
- ❌ The
&& truepattern defeatsset -eerror detection - ❌ Error code captured but not logged before
err_exit - ❌ No context about what operation failed
- ❌ Missing error classification (transient vs. permanent)
- ❌ No retry logic for transient failures (network, filesystem)
- ❌ Operator receives minimal information for incident response
Recommended Fix
###############################################################
# Run relevant exglobal script
###############################################################
# Capture script name and location for logging
EXSCRIPT="${FORECASTSH:-${SCRgfs}/exglobal_forecast.sh}"
# Log execution attempt
echo "================================================================"
echo "Executing: ${EXSCRIPT}"
echo "Start time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Host: ${HOSTNAME}"
echo "Working directory: ${PWD}"
echo "Cycle: PDY=${PDY} cyc=${cyc} RUN=${RUN}"
echo "================================================================"
# Execute with explicit error capture
if ! "${EXSCRIPT}"; then
export err=$?
# Log comprehensive error information
echo "================================================================"
echo "FATAL ERROR: Script execution failed"
echo "----------------------------------------------------------------"
echo "Script: ${EXSCRIPT}"
echo "Exit code: ${err}"
echo "Error time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "Error location: ${HOSTNAME}:${PWD}"
echo "Cycle info: PDY=${PDY} cyc=${cyc} RUN=${RUN}"
echo "================================================================"
# Classify error type for operator guidance
if [ ${err} -eq 137 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err}--eq-137-); then
echo "ERROR TYPE: Process killed (SIGKILL)"
echo "LIKELY CAUSE: Out of memory or walltime exceeded"
echo "RECOMMENDATION: Check resource allocations in job card"
elif [ ${err} -eq 143 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err}--eq-143-); then
echo "ERROR TYPE: Process terminated (SIGTERM)"
echo "LIKELY CAUSE: Job cancelled or walltime limit"
elif [ ${err} -ge 128 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err}--ge-128-); then
signal=$((err - 128))
echo "ERROR TYPE: Terminated by signal ${signal}"
elif [ ${err} -eq 1 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${err}--eq-1-); then
echo "ERROR TYPE: General script error"
echo "RECOMMENDATION: Check script logs for details"
else
echo "ERROR TYPE: Script-specific error code ${err}"
fi
# Display last 50 lines of output if available
if [ -s "${pgmout}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--s-"${pgmout}"-); then
echo "----------------------------------------------------------------"
echo "Last 50 lines of output:"
echo "----------------------------------------------------------------"
tail -50 "${pgmout}"
fi
echo "================================================================"
# Call error exit with context
err_exit "Forecast execution failed with error code ${err}"
fi
# Log successful completion
echo "================================================================"
echo "Script completed successfully: ${EXSCRIPT}"
echo "End time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "================================================================"
Impact Assessment
- Severity: HIGH
- Files Affected: ~89 job scripts in
jobs/, ~83 execution scripts inscripts/ - Production Risk: Difficult debugging, extended Mean Time To Repair (MTTR), operator frustration
- Estimated Fix Effort: 3-4 weeks (requires testing each modified script)
3. MISSING ENVIRONMENT VARIABLE VALIDATION ⚠️ HIGH
Finding
Scripts use critical environment variables without validation, using default values that can lead to silent failures or incorrect behavior.
Example: jobs/JGLOBAL_FORECAST (lines 3-11)
if ((10#${ENSMEM:-0} > 0)); then
export DATAjob="${DATAROOT}/${RUN}efcs${ENSMEM}.${PDY:-}${cyc}"
export DATA="${DATAjob}/${jobid}"
source "${HOMEgfs}/ush/jjob_header.sh" -e "efcs" -c "base fcst efcs"
else
export DATAjob="${DATAROOT}/${RUN}fcst.${PDY:-}${cyc}"
export DATA="${DATAjob}/${jobid}"
source "${HOMEgfs}/ush/jjob_header.sh" -e "fcst" -c "base fcst"
fi
Example: scripts/exglobal_atmos_analysis.sh (lines 22-28)
# Base variables
rCDUMP=${rCDUMP:-"gdas"}
GDUMP=${GDUMP:-"gdas"}
# Derived base variables
# shellcheck disable=SC2153
GDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} - ${assim_freq} hours")
BDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} - 3 hours")
EE2 Violations
- ❌
${PDY:-}defaults to empty string instead of failing fast - ❌ Critical variables (
DATAROOT,RUN,jobid,cyc,HOMEgfs) used without validation - ❌ No check if required directories exist
- ❌ Date calculations can fail silently with malformed inputs
- ❌
assim_freqused without validation in date arithmetic - ❌ Silent path construction errors lead to wrong directory usage
Recommended Fix
#!/usr/bin/env bash
#======================================================================
# Environment Variable Validation
#======================================================================
# Function to validate required environment variables
validate_required_vars() {
local missing_vars=()
local invalid_vars=()
# Check required variables are set and non-empty
local required_vars=(
"HOMEgfs"
"DATAROOT"
"RUN"
"PDY"
"cyc"
"jobid"
"assim_freq"
)
for var in "${required_vars[@]}"; do
if [ -z "${!var:-}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--z-"${!var:-}"-); then
missing_vars+=("${var}")
fi
done
# Report missing variables
if [ ${#missing_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#missing_vars[@]}--gt-0-); then
echo "================================================================"
echo "FATAL ERROR: Required environment variables not set"
echo "----------------------------------------------------------------"
printf " %s\n" "${missing_vars[@]}"
echo "================================================================"
exit 1
fi
# Validate date format (PDY must be YYYYMMDD)
if [ ! "${PDY}" =~ ^[0-9]{8}$ ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!-"${PDY}"-=~-^[0-9]{8}$-); then
invalid_vars+=("PDY=${PDY} (expected YYYYMMDD format)")
fi
# Validate cycle (cyc must be HH)
if [ ! "${cyc}" =~ ^[0-9]{2}$ ]] ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/|-[[-${cyc}--gt-23-); then
invalid_vars+=("cyc=${cyc} (expected HH format, 00-23)")
fi
# Validate assim_freq is a positive integer
if [ ! "${assim_freq}" =~ ^[0-9]+$ ]] ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/|-[[-${assim_freq}--lt-1-); then
invalid_vars+=("assim_freq=${assim_freq} (expected positive integer)")
fi
# Report invalid formats
if [ ${#invalid_vars[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#invalid_vars[@]}--gt-0-); then
echo "================================================================"
echo "FATAL ERROR: Invalid environment variable formats"
echo "----------------------------------------------------------------"
printf " %s\n" "${invalid_vars[@]}"
echo "================================================================"
exit 1
fi
# Validate critical directories exist
if [ ! -d "${HOMEgfs}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${HOMEgfs}"-); then
echo "FATAL ERROR: HOMEgfs directory does not exist: ${HOMEgfs}"
exit 1
fi
if [ ! -d "${DATAROOT}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--d-"${DATAROOT}"-); then
echo "FATAL ERROR: DATAROOT directory does not exist: ${DATAROOT}"
exit 1
fi
# Validate date arithmetic will work
if ! GDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} - ${assim_freq} hours" 2>&1); then
echo "FATAL ERROR: Date calculation failed"
echo " PDY=${PDY}, cyc=${cyc}, assim_freq=${assim_freq}"
echo " Error: ${GDATE}"
exit 1
fi
echo "Environment variable validation: PASSED"
}
# Run validation at script start
validate_required_vars
#======================================================================
# Safe path construction
#======================================================================
# Now safely construct paths (no defaults, fail if unset)
if ((10#${ENSMEM:-0} > 0)); then
export DATAjob="${DATAROOT}/${RUN}efcs${ENSMEM}.${PDY}${cyc}"
export DATA="${DATAjob}/${jobid}"
echo "Ensemble member ${ENSMEM}: DATA=${DATA}"
source "${HOMEgfs}/ush/jjob_header.sh" -e "efcs" -c "base fcst efcs"
else
export DATAjob="${DATAROOT}/${RUN}fcst.${PDY}${cyc}"
export DATA="${DATAjob}/${jobid}"
echo "Deterministic forecast: DATA=${DATA}"
source "${HOMEgfs}/ush/jjob_header.sh" -e "fcst" -c "base fcst"
fi
# Validate derived date variables
declare -rx GDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} - ${assim_freq} hours")
declare -rx gPDY="${GDATE:0:8}"
declare -rx gcyc="${GDATE:8:2}"
echo "Derived dates: GDATE=${GDATE} (gPDY=${gPDY} gcyc=${gcyc})"
Impact Assessment
- Severity: HIGH
- Files Affected: All job and execution scripts (~172 files)
- Production Risk: Undefined variables cause cascading failures, data corruption, or operations on wrong directories/dates
- Estimated Fix Effort: 2-3 weeks (validation function can be centralized in
preamble.sh)
4. WEAK ERROR HANDLING IN UTILITY FUNCTIONS ⚠️ MEDIUM-HIGH
Finding
Critical utility functions that all scripts depend on lack proper error handling, validation, and debugging support.
Example: ush/bash_utils.sh - declare_from_tmpl() function (lines 52-72)
for input in "$@"; do
IFS=':' read -ra args <<< "${input}"
local com_var="${args[0]}"
local template
local value
if (( ${#args[@]} > 1 )); then
template="${args[1]}"
else
template="${com_var}_TMPL"
fi
if [ ! -v "${template}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--v-"${template}"-); then
echo "FATAL ERROR in declare_from_tmpl: Requested template ${template} not defined!"
exit 2
fi
value=$(echo "${!template}" | envsubst)
# shellcheck disable=SC2086
declare ${opts} "${com_var}"="${value}"
echo "declare_from_tmpl :: ${com_var}=${value}"
done
EE2 Violations
- ❌
envsubstcan fail silently if substitutions are malformed - ❌ No validation that resulting
valueis non-empty - ❌ No check if
envsubstcommand is available - ❌ Error messages go to stdout instead of stderr
- ❌ Exit code 2 is not documented or standardized
- ❌ No logging of template expansion details for debugging
- ❌ Template errors don't show calling context
Recommended Fix
function declare_from_tmpl() {
#
# Define variables from corresponding templates by substituting in env variables.
#
# Each template must already be defined. Any variables in the template are replaced
# with their values. Undefined variables are just removed WITHOUT raising an error.
#
# Template can be used implicitly, however, all declared COM variables must be
# defined as either COMIN or COMOUT, therefore the template should be explicit
#
# Accepts as options `-r` and `-x`, which do the same thing as the same options in
# `declare`. Variables are automatically marked as `-g` so the variable is visible
# in the calling script.
#
# Syntax:
# declare_from_tmpl [-rx] $var1[:$tmpl1] [$var2[:$tmpl2]] [...]]
#
# options:
# -r: Make variable read-only (same as `declare -r`)
# -x: Mark variable for export (same as `declare -x`)
# var1, var2, etc: Variable names whose values will be generated from a template
# and declared
# tmpl1, tmpl2, etc: Specify the template to use (default is "${var}_TMPL")
#
# Exit codes:
# 0: Success
# 1: Invalid arguments
# 2: Template not defined
# 3: envsubst not found
# 4: envsubst failed
# 5: Template expansion resulted in empty value for required variable
#
if [ ${DEBUG_WORKFLOW:-"NO"} == "NO" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${DEBUG_WORKFLOW:-"NO"}-==-"NO"-); then set +x; fi
local opts="-g"
local OPTIND=1
while getopts "rx" option; do
opts="${opts}${option}"
done
shift $((OPTIND-1))
# Validate envsubst is available
if ! command -v envsubst &> /dev/null; then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR in declare_from_tmpl: envsubst command not found"
>&2 echo "================================================================"
exit 3
fi
for input in "$@"; do
IFS=':' read -ra args <<< "${input}"
local com_var="${args[0]}"
local template
local value
local template_value
# Validate we have a variable name
if [ -z "${com_var}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--z-"${com_var}"-); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR in declare_from_tmpl: Empty variable name"
>&2 echo " Called from: ${BASH_SOURCE[1]:-unknown}:${BASH_LINENO[0]:-unknown}"
>&2 echo "================================================================"
exit 1
fi
# Determine template name
if (( ${#args[@]} > 1 )); then
template="${args[1]}"
else
template="${com_var}_TMPL"
fi
# Validate template exists and is non-empty
if [ ! -v "${template}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-!--v-"${template}"-); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR in declare_from_tmpl: Template not defined"
>&2 echo "----------------------------------------------------------------"
>&2 echo " Variable: ${com_var}"
>&2 echo " Template: ${template}"
>&2 echo " Called from: ${BASH_SOURCE[1]:-unknown}:${BASH_LINENO[0]:-unknown}"
>&2 echo "----------------------------------------------------------------"
>&2 echo "Available templates matching pattern:"
>&2 compgen -v | grep "_TMPL$" | sed 's/^/ /'
>&2 echo "================================================================"
exit 2
fi
template_value="${!template}"
if [ -z "${template_value}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--z-"${template_value}"-); then
>&2 echo "WARNING in declare_from_tmpl: Template ${template} is empty"
fi
# Expand template with error checking
if ! value=$(echo "${template_value}" | envsubst 2>&1); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR in declare_from_tmpl: envsubst failed"
>&2 echo "----------------------------------------------------------------"
>&2 echo " Variable: ${com_var}"
>&2 echo " Template: ${template}"
>&2 echo " Template value: ${template_value}"
>&2 echo " Error output: ${value}"
>&2 echo " Called from: ${BASH_SOURCE[1]:-unknown}:${BASH_LINENO[0]:-unknown}"
>&2 echo "================================================================"
exit 4
fi
# Validate result is non-empty for COMIN/COMOUT variables
if [ "${com_var}" =~ ^COM(IN](/TerrenceMcGuinness-NOAA/global-workflow/wiki/OUT)_-&&--z-"${value}"-); then
>&2 echo "================================================================"
>&2 echo "FATAL ERROR in declare_from_tmpl: Empty value after expansion"
>&2 echo "----------------------------------------------------------------"
>&2 echo " Variable: ${com_var}"
>&2 echo " Template: ${template}"
>&2 echo " Template value: ${template_value}"
>&2 echo " Expanded value: (empty)"
>&2 echo " Called from: ${BASH_SOURCE[1]:-unknown}:${BASH_LINENO[0]:-unknown}"
>&2 echo "----------------------------------------------------------------"
>&2 echo "Check that all required environment variables are set:"
>&2 echo " RUN, YMD, HH, MEMDIR, etc."
>&2 echo "================================================================"
exit 5
fi
# Log expansion for debugging (to stdout)
echo "declare_from_tmpl :: ${com_var}=${value}"
if [ "${DEBUG_WORKFLOW:-NO}" == "YES" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${DEBUG_WORKFLOW:-NO}"-==-"YES"-); then
echo " (from template ${template}=${template_value})"
fi
# Declare the variable with specified options
# shellcheck disable=SC2086
declare ${opts} "${com_var}"="${value}"
done
set_trace
}
Impact Assessment
- Severity: MEDIUM-HIGH
- Files Affected: ~10-15 utility scripts in
ush/, impacts all scripts that use them - Production Risk: Template expansion failures cause silent data misplacement or access to wrong COM directories
- Estimated Fix Effort: 1-2 weeks (centralized fixes with comprehensive testing)
5. INCONSISTENT SET -E USAGE AND MISSING TRAP HANDLERS ⚠️ MEDIUM
Finding
Scripts toggle set -e behavior inconsistently and don't implement proper trap handlers for cleanup on exit, error, or interrupt.
Example: ush/preamble.sh (lines 32-42)
set_strict() {
if [ ${STRICT:-"YES"} == "YES" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${STRICT:-"YES"}-==-"YES"-); then
# Exit on error and undefined variable
set -eu
fi
}
Example: scripts/exglobal_atmos_products.sh (lines 95-100)
# grep returns 1 if no match is found, so temporarily turn off exit on non-zero rc
set +e
# shellcheck disable=SC2312
${WGRIB2} -d "${last}" "${tmpfile}" | grep -E -i "ugrd|ustm|uflx|u-gwd|land|maxuw"
rc=$?
set_strict
if [ ${rc} == 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${rc}-==-0-); then # Matched the grep
Example: jobs/JGLOBAL_FORECAST (lines 131-143) - cleanup without trap
##########################################
# Remove the Temporary working directory
##########################################
cd "${DATAROOT}" || true
# do not remove DATAjob. It contains DATAoutput
if [ "${KEEPDATA}" == "NO" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${KEEPDATA}"-==-"NO"-); then
rm -rf "${DATA}"
# Determine if this is the last segment
commas="${FCST_SEGMENTS//[^,]/}"
n_segs=${#commas}
if ((n_segs - 1 == ${FCST_SEGMENT:-0})); then
# Only delete temporary restarts if it is the last segment
rm -rf "${DATArestart}"
fi
fi
EE2 Violations
- ❌ Toggling
set -eon/off is error-prone and hard to audit - ❌
set +e...set_strictpattern doesn't guarantee state restoration if script exits between them - ❌ No trap handlers for cleanup on EXIT, ERR, INT, or TERM signals
- ❌ Temporary files not cleaned up if script exits unexpectedly
- ❌ Background processes not tracked or killed on error
- ❌ Cleanup only runs if script reaches the end normally
- ❌ Resource leaks (disk space, processes) in failure scenarios
Recommended Fix
Add to ush/preamble.sh:
# Global array to track temporary files and directories
declare -a CLEANUP_FILES=()
declare -a CLEANUP_DIRS=()
declare -a BACKGROUND_PIDS=()
function register_cleanup_file() {
# Register a file for cleanup on exit
# Usage: register_cleanup_file /path/to/file
CLEANUP_FILES+=("$1")
}
function register_cleanup_dir() {
# Register a directory for cleanup on exit
# Usage: register_cleanup_dir /path/to/dir
CLEANUP_DIRS+=("$1")
}
function register_background_job() {
# Register a background PID for cleanup on exit
# Usage: register_background_job $!
BACKGROUND_PIDS+=("$1")
}
function cleanup_on_exit() {
#
# Cleanup function to run on script exit (normal or error).
# Kills background jobs, removes temporary files/directories.
#
local exit_code=$?
set +x # Don't trace cleanup operations
echo "================================================================"
echo "Running cleanup (exit code: ${exit_code})"
echo "================================================================"
# Kill any tracked background jobs
if [ ${#BACKGROUND_PIDS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#BACKGROUND_PIDS[@]}--gt-0-); then
echo "Killing ${#BACKGROUND_PIDS[@]} background job(s)..."
for pid in "${BACKGROUND_PIDS[@]}"; do
if kill -0 "${pid}" 2>/dev/null; then
echo " Killing PID ${pid}"
kill "${pid}" 2>/dev/null || true
fi
done
fi
# Kill any untracked background jobs from this script
local untracked_jobs
untracked_jobs=$(jobs -p)
if [ -n "${untracked_jobs}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${untracked_jobs}"-); then
echo "Killing untracked background jobs..."
echo "${untracked_jobs}" | xargs -r kill 2>/dev/null || true
fi
# Remove temporary files
if [ ${#CLEANUP_FILES[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_FILES[@]}--gt-0-); then
echo "Removing ${#CLEANUP_FILES[@]} temporary file(s)..."
for file in "${CLEANUP_FILES[@]}"; do
if [ -f "${file}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--f-"${file}"-); then
rm -f "${file}" 2>/dev/null || true
fi
done
fi
# Remove temporary directories based on KEEPDATA
if [ "${KEEPDATA:-NO}" == "NO" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-"${KEEPDATA:-NO}"-==-"NO"-); then
if [ ${#CLEANUP_DIRS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_DIRS[@]}--gt-0-); then
echo "Removing ${#CLEANUP_DIRS[@]} temporary director(ies)..."
for dir in "${CLEANUP_DIRS[@]}"; do
if [ -d "${dir}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--d-"${dir}"-); then
echo " Removing ${dir}"
rm -rf "${dir}" 2>/dev/null || true
fi
done
fi
# Clean up DATA if defined
if [ -n "${DATA:-}" && -d "${DATA}" ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/--n-"${DATA:-}"-&&--d-"${DATA}"-); then
echo "Removing working directory: ${DATA}"
cd "${DATAROOT:-/tmp}" 2>/dev/null || cd /tmp
rm -rf "${DATA}" 2>/dev/null || true
fi
else
echo "KEEPDATA=YES: Preserving temporary directories"
echo " DATA=${DATA:-not set}"
[ ${#CLEANUP_DIRS[@]} -gt 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${#CLEANUP_DIRS[@]}--gt-0-) && printf " %s\n" "${CLEANUP_DIRS[@]}"
fi
echo "================================================================"
echo "Cleanup complete"
echo "================================================================"
return ${exit_code}
}
function error_trap() {
#
# Trap handler for ERR signal - provides context when errors occur
#
local exit_code=$?
local line_number=$1
set +x
>&2 echo "================================================================"
>&2 echo "ERROR TRAPPED"
>&2 echo "----------------------------------------------------------------"
>&2 echo "Exit code: ${exit_code}"
>&2 echo "Line number: ${line_number}"
>&2 echo "Script: ${BASH_SOURCE[1]:-unknown}"
>&2 echo "Function: ${FUNCNAME[1]:-main}"
>&2 echo "Command: ${BASH_COMMAND}"
>&2 echo "================================================================"
# Export error for err_exit
export err=${exit_code}
}
# Set trap handlers early (before set -e takes effect)
trap cleanup_on_exit EXIT
trap 'error_trap ${LINENO}' ERR
trap 'echo "Interrupt received, cleaning up..."; exit 130' INT
trap 'echo "Termination signal received, cleaning up..."; exit 143' TERM
# Export cleanup functions for use in scripts
declare -fx register_cleanup_file
declare -fx register_cleanup_dir
declare -fx register_background_job
declare -fx cleanup_on_exit
Usage in scripts:
# Register working directory for cleanup
register_cleanup_dir "${DATA}"
register_cleanup_dir "${DATArestart}"
# Register temporary files
tmpfile="${DATA}/tmpfile_$$"
register_cleanup_file "${tmpfile}"
# Register background jobs
./long_running_process.sh &
register_background_job $!
# For operations that might fail (instead of set +e)
if ${WGRIB2} -d "${last}" "${tmpfile}" | grep -E -i "pattern"; then
# Pattern found
last=$((last + 1))
fi
# Or explicitly check exit code
${WGRIB2} -d "${last}" "${tmpfile}" | grep -E -i "pattern" || rc=$?
if [ ${rc:-0} -eq 0 ](/TerrenceMcGuinness-NOAA/global-workflow/wiki/-${rc:-0}--eq-0-); then
# Pattern found
last=$((last + 1))
fi
Impact Assessment
- Severity: MEDIUM
- Files Affected: All scripts, but changes can be centralized in
preamble.sh - Production Risk: Incomplete cleanup fills disk space, orphaned processes waste resources, inconsistent error behavior complicates debugging
- Estimated Fix Effort: 1-2 weeks (infrastructure changes with selective script updates)
Implementation Priority and Timeline
Phase 1: Foundation (Weeks 1-3)
Priority: CRITICAL
- Python error handling - Highest production risk
- Create template for exception handling
- Apply to top 10 most-used scripts
- Expand to all Python scripts
- Deliverable: All Python scripts with comprehensive error handling
Phase 2: Environment Validation (Weeks 4-6)
Priority: HIGH 2. Environment variable validation - Prevents cascading failures
- Add validation function to
preamble.sh - Update job scripts to call validation
- Add validation to execution scripts
- Deliverable: Centralized validation with 100% script coverage
Phase 3: Error Reporting (Weeks 7-9)
Priority: HIGH 3. Shell error exit handling - Improves debuggability
- Create enhanced error reporting functions
- Update top 20 most-critical job scripts
- Expand to all scripts progressively
- Deliverable: Consistent error reporting across all scripts
Phase 4: Utilities (Weeks 10-11)
Priority: MEDIUM-HIGH 4. Utility function error handling - Foundation fixes
- Fix
declare_from_tmpl()and other core functions - Add comprehensive error checking
- Add debug mode support
- Deliverable: Robust utility library
Phase 5: Cleanup Infrastructure (Weeks 12-14)
Priority: MEDIUM 5. Trap handlers and cleanup - Quality of life
- Implement cleanup infrastructure in
preamble.sh - Update scripts to register resources
- Test cleanup in failure scenarios
- Deliverable: Automatic resource cleanup
Testing Strategy
Unit Testing
- Test each modified script in isolation
- Verify error handling with intentional failures
- Validate cleanup runs in all exit scenarios
Integration Testing
- Run full workflow cycles with modified scripts
- Test interaction between jobs and execution scripts
- Verify error propagation through workflow
Failure Testing
- Kill jobs at various stages
- Introduce resource exhaustion scenarios
- Test signal handling (INT, TERM, KILL)
- Verify cleanup happens correctly
Operational Testing
- Run parallel to production for 1-2 weeks
- Compare error messages and debugging time
- Gather operator feedback
Success Metrics
Quantitative
- Zero unhandled exceptions in Python scripts
- 100% environment variable validation before use
- <5 minute MTTR reduction from improved error messages
- Zero resource leaks (disk, processes) after failures
- 100% cleanup execution on abnormal termination
Qualitative
- Operators can diagnose failures without developer assistance
- Error messages provide actionable guidance
- Log files contain sufficient context for root cause analysis
- Scripts fail fast with clear error messages
- Development/testing cycles faster due to better error reporting
Appendix: EE2 Compliance Standards Reference
Key EE2 Requirements
- Error Handling: All errors must be trapped and handled explicitly
- Exit Codes: Use standardized exit codes (0=success, 1-99=specific errors, 100+=signals)
- Logging: All errors must be logged to stderr with timestamps and context
- Validation: All inputs and environment variables must be validated before use
- Cleanup: All resources must be cleaned up on normal exit and error conditions
- Traceability: All operations must be traceable through log files
- Graceful Degradation: Scripts must fail gracefully with clear error messages
Standard Exit Codes
- 0: Success
- 1: General error
- 2: Configuration error
- 3: Missing dependency/command
- 4: I/O error (file not found, permission denied)
- 5: Validation error (invalid data)
- 99: Unexpected error
- 130: Interrupted (SIGINT)
- 137: Killed (SIGKILL)
- 143: Terminated (SIGTERM)
Document Metadata
- Author: AI Analysis System
- Date: November 3, 2025
- Repository: NOAA-EMC/global-workflow
- Branch: Analyzed fork
- Analysis Tools: MCP RAG System, Static Analysis
- Files Analyzed: 255+ (jobs, scripts, utilities)
- Review Status: Initial Draft
- Next Review: TBD
Contact and Questions
For questions about this analysis or implementation guidance, contact:
- Global Workflow Team: [email protected]
- NCO EE2 Compliance: [email protected]
END OF DOCUMENT