Rocoto Dryrun Mode PR Restore and Harden - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Rocoto --dryrun Mode: Restore and Harden PR

Branch: feature/dryrun_nodaemonchristopherwharrop/rocoto:develop Author: Terry McGuinness ([email protected]) Date: March 27, 2026 Files Changed: 17 files, +131 / -30


Summary

This PR restores the --dryrun / -n flag (originally merged as PR #117, reverted in #119) with comprehensive fixes for the daemon interaction bugs that caused the revert.

The original implementation added the flag and batch system guards but did not account for Rocoto's DRb server architecture — when BatchQueueServer=true (the default), rocotorun forks separate rocotobqserver, rocotodbserver, and rocotoioserver daemon processes. In dryrun mode these daemons are unnecessary and their launch/interaction caused NameError crashes, thread pool deadlocks, and DRb connection failures.

This branch re-lands the original dryrun work and adds 14 targeted fixes across the daemon proxy layer, the workflow engine, and the server entry point to make --dryrun mode safe regardless of configuration.


What --dryrun Does

When rocotorun -n or rocotorun --dryrun is invoked, Rocoto processes the workflow XML, evaluates task dependencies, and logs which jobs would be submitted — but never calls sbatch, qsub, bsub, etc. No jobs are submitted to any batch scheduler. This is essential for:

  • CI/CD validation — verify workflow XML parses and dependency logic is correct without consuming HPC allocation
  • Pre-flight checks — operators can confirm task ordering before committing to a production run
  • Debugging — trace Rocoto's decision-making without side effects

Changes by Category

1. Central dryrun_mode? Helper (utilities.rb)

  • Added WorkflowMgr.dryrun_mode? class method that checks the DRYRUN constant
  • Defensive const_defined?(:DRYRUN) guard so any execution context (including forked daemons) defaults to false instead of crashing with NameError
def self.dryrun_mode?
  return false unless const_defined?(:DRYRUN)
  DRYRUN > 0
end

2. CLI Option Parsing (workflowoption.rb, reportoption.rb, workflowsubsetoptions.rb)

  • Added -n / --dryrun option to rocotorun, rocotoboot, rocotorewind, rocotostat, and subset commands
  • Sets WorkflowMgr.const_set("DRYRUN", 1) when flag is present, 0 when absent
  • Updated usage banner to include [-n]

3. Daemon Launch Suppression (bqsproxy.rb, dbproxy.rb, workflowioproxy.rb)

This is the core fix for the original revert:

Proxy Guard Added Effect in Dryrun
bqsproxy.rb && !WorkflowMgr.dryrun_mode? Skip forking rocotobqserver; use in-process BQS object
dbproxy.rb && !WorkflowMgr.dryrun_mode? Skip forking rocotodbserver; use in-process database object
workflowioproxy.rb && !WorkflowMgr.dryrun_mode? Skip forking rocotoioserver; use in-process I/O object

Without these guards, dryrun mode forked daemon processes that either crashed (missing DRYRUN constant) or created unnecessary DRb servers.

4. Daemon Constant for Non-Dryrun Mode (sbin/rocotoserver)

  • Set WorkflowMgr.const_set("DRYRUN", 0) in the shared server entry point used by rocotobqserver, rocotodbserver, and rocotoioserver (via symlinks)
  • Ensures dryrun_mode? works inside daemon processes when they are launched (non-dryrun mode)

5. Workflow Engine Dryrun Guards (workflowengine.rb — 10 sites)

Code Path Guard Added Why
@bqServer.__drburi (3 sites) && !WorkflowMgr.dryrun_mode? No DRb server exists; use 0 as placeholder jobid
@dbServer.add_cycles unless dryrun_mode? No database writes in dryrun
@dbServer.update_cycles unless dryrun_mode? No database writes in dryrun
@dbServer.add_jobs unless dryrun_mode? No database writes in dryrun
@dbServer.delete_jobs unless dryrun_mode? No database writes in dryrun
@dbServer.update_jobs unless dryrun_mode? No database writes in dryrun
@dbServer.add_bqservers && !WorkflowMgr.dryrun_mode? No BQS URI to register
Thread.list.each { t.join } (2 sites) || WorkflowMgr.dryrun_mode? Thread pool workers sleep indefinitely in dryrun causing deadlock
harvest_pending_jobids !dryrun_mode? && before __drburi Skip BQS server filter when no server exists
ensure block server shutdown (3 sites) || WorkflowMgr.dryrun_mode? No servers to shut down

Dryrun-specific log messages added: "Dryrun: would submit..." and "would be booted (dryrun mode)".

6. Workflow Report Thread Guard (workflowreport.rb)

  • Same thread-join deadlock fix as the engine — skip Thread.list.each { t.join } in dryrun mode

7. Batch System Submit Guards (7 Schedulers)

Each batch system's submit() method now checks dryrun_mode? and returns output="This is a dryrun" instead of executing the shell command:

Scheduler File Command Guarded
Slurm slurmbatchsystem.rb sbatch
PBS Pro pbsprobatchsystem.rb qsub
Torque torquebatchsystem.rb qsub
Moab moabbatchsystem.rb msub
LSF lsfbatchsystem.rb bsub
LSF Cray lsfcraybatchsystem.rb bsub
Cobalt cobaltbatchsystem.rb qsub

Files Changed

lib/workflowmgr/utilities.rb             +8
lib/workflowmgr/workflowoption.rb       +10
lib/workflowmgr/reportoption.rb         +11
lib/workflowmgr/workflowsubsetoptions.rb +1/-1
lib/workflowmgr/bqsproxy.rb             +2/-1
lib/workflowmgr/dbproxy.rb              +2/-1
lib/workflowmgr/workflowioproxy.rb       +2/-1
lib/workflowmgr/workflowengine.rb       +52/-19
lib/workflowmgr/workflowreport.rb       +4/-1
lib/workflowmgr/slurmbatchsystem.rb     +5/-1
lib/workflowmgr/pbsprobatchsystem.rb    +5/-1
lib/workflowmgr/torquebatchsystem.rb    +5/-1
lib/workflowmgr/moabbatchsystem.rb      +5/-1
lib/workflowmgr/lsfbatchsystem.rb       +5/-1
lib/workflowmgr/lsfcraybatchsystem.rb   +5/-1
lib/workflowmgr/cobaltbatchsystem.rb    +5/-1
sbin/rocotoserver                        +3

Testing

This feature branch is deployed and actively running across the NOAA Global Workflow GitLab CI/CD pipeline infrastructure on 5 RDHPCS platforms, exercised nightly by automated pipelines and on-demand by PR-triggered builds. The custom Rocoto build from feature/dryrun_nodaemon is injected into the pipeline PATH via the GFS_CI_ROCOTO_PATH environment variable, replacing the system-installed Rocoto for all rocotorun, rocotostat, and rocotocheck invocations.

RDHPCS Platform Deployment

Platform Scheduler Custom Build Path Status
Hera Slurm ${GFS_CI_UTIL_PATH}/src/rocoto-1.3.7-dryrun_nodaemon/bin Active
Orion Slurm ${GFS_CI_UTIL_PATH}/src/rocoto_dryrun_nodaemon/bin Active
Hercules Slurm ${GFS_CI_UTIL_PATH}/src/rocoto_dryrun_nodaemon/bin Active
Gaea C6 Slurm ${GFS_CI_UTIL_PATH}/rocoto_dryrun_nodaemon/bin Active
Ursa Slurm ${GFS_CI_UTIL_PATH}/src/rocoto-1.3.7-dryrun_nodaemon/bin Active
WCOSS2 PBS (commented out — unset GFS_CI_ROCOTO_PATH) Disabled

Each platform's configuration lives in global-workflow/dev/ci/platforms/config.<platform> and exports GFS_CI_ROCOTO_PATH pointing to the feature branch build.

GitLab CI/CD Pipeline Integration

The custom Rocoto build is injected at three pipeline entry points via PATH prepend:

  1. gitlab-ci-cases.yml (experiment setup and execution stage) — sources the platform config, then:

    if [[ -n "${GFS_CI_ROCOTO_PATH:-}" && -d "${GFS_CI_ROCOTO_PATH}" ]]; then
      export PATH="${GFS_CI_ROCOTO_PATH}:${PATH}"
      echo "Using custom Rocoto build at: $(which rocotorun 2>/dev/null || echo 'NOT FOUND')"
    fi
  2. gitlab-ci-ctests.yml (CTest execution stage) — same injection in both the setup and run phases

  3. run_check_ci.sh / run_check_gitlab_ci.sh (experiment execution scripts) — invoke the Rocoto toolchain:

    • rocotorun -v ${ROCOTO_VERBOSE:-0} -w <xml> -d <db> — launch/advance workflows
    • rocotostat.py -w <xml> -d <db> — monitor workflow state and job counts
    • rocotostat -d <db> -w <xml> — extract FAIL/DEAD job details
    • rocotocheck -d <db> -w <xml> — retrieve error logs for failed tasks

rocotostat.py — Workflow Status Monitor

The CI pipeline relies on global-workflow/dev/ci/scripts/utils/rocotostat.py to determine workflow completion, stall detection, and failure counts. This Python utility wraps rocotostat --summary and rocotostat --all with built-in retry logic (telescoping delays):

Operation Max Attempts Delay Between
rocotostat --summary 3 120 sec
rocotostat --all (job counts) 4 120 sec
rocotorun retry on stall 2 120 sec

Returns workflow state as one of: DONE, FAIL, RUNNING, STALLED, UNAVAILABLE, UNKNOWN with per-state job counts (SUCCEEDED, FAIL, DEAD, RUNNING, SUBMITTING, QUEUED). The --thread-logging flag monitors process thread counts during execution for detecting daemon thread leaks — directly relevant to the thread pool deadlock fix in this branch. Unit tests in dev/ci/scripts/unittests/test_rocotostat.py.

Test Case Coverage Across Platforms

The nightly and PR-triggered pipelines execute a matrix of Global Workflow experiment cases per platform:

Platform Test Cases Count
Hera C48_ATM, C48_S2SW, C48_S2SWA_gefs, C48mx500_3DVarAOWCDA, C48mx500_hybAOWCDA, C96C48_hybatmDA, C96C48_hybatmsnowDA, C96C48_hybatmsoilDA, C96C48_ufsgsi_hybatmDA, C96C48_ufs_hybatmDA, C96C48mx500_S2SW_cyc_gfs, C96_atm3DVar, C96_gcafs_cycled, C96_gcafs_cycled_noDA, C96mx100_S2S, C48_gsienkf_atmDA, C48_ufsenkf_atmDA 17
Ursa Same as Hera 17
Gaea C6 Hera set minus DA variants 15
Hercules Reduced DA set 10
Orion Minimal set 8

Each case exercises the full Rocoto lifecycle: XML parsing → dependency evaluation → job submission (rocotorun) → status polling (rocotostat.py) → error extraction (rocotocheck) → completion/failure determination. The feature branch build handles all of these operations across all active platforms.


Relationship to PR #117 / #119

PR Commit Status Notes
#117 438143b Merged → Reverted Original --dryrun implementation
#119 79304a1 Merged Reverted #117 due to daemon interaction issues
This PR abba59a (HEAD) Ready Re-applies #117 with all daemon-layer fixes

Why PR #117 Was Reverted

The original PR missed a critical architectural detail: Rocoto's BatchQueueServer, DatabaseServer, and WorkflowIOServer config flags cause rocotorun to fork child daemon processes via WorkflowMgr.launchServer(). These daemons run sbin/rocotoserver (shared via symlinks) which never set the DRYRUN constant. When submit() inside the daemon checked WorkflowMgr.dryrun_mode?, it hit an uninitialized constant NameError, silently killing the thread pool task and leaving jobs stuck at PENDING forever.

What This PR Adds Beyond #117

  1. Daemon suppression — Don't fork DRb servers at all in dryrun mode (3 proxy files)
  2. Constant initialization — Set DRYRUN=0 in rocotoserver for non-dryrun daemon processes
  3. Defensive helperconst_defined? guard prevents NameError in any context
  4. Engine hardening — 10 guard sites in workflowengine.rb to skip DB writes, DRb URI access, and thread joins
  5. Deadlock prevention — Thread pool workers sleep forever in dryrun; skip Thread.join to avoid hang
⚠️ **GitHub.com Fallback** ⚠️