Rocoto Dryrun Mode PR Restore and Harden - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
Branch: feature/dryrun_nodaemon → christopherwharrop/rocoto:develop
Author: Terry McGuinness ([email protected])
Date: March 27, 2026
Files Changed: 17 files, +131 / -30
This PR restores the --dryrun / -n flag (originally merged as PR #117, reverted in #119) with comprehensive fixes for the daemon interaction bugs that caused the revert.
The original implementation added the flag and batch system guards but did not account for Rocoto's DRb server architecture — when BatchQueueServer=true (the default), rocotorun forks separate rocotobqserver, rocotodbserver, and rocotoioserver daemon processes. In dryrun mode these daemons are unnecessary and their launch/interaction caused NameError crashes, thread pool deadlocks, and DRb connection failures.
This branch re-lands the original dryrun work and adds 14 targeted fixes across the daemon proxy layer, the workflow engine, and the server entry point to make --dryrun mode safe regardless of configuration.
When rocotorun -n or rocotorun --dryrun is invoked, Rocoto processes the workflow XML, evaluates task dependencies, and logs which jobs would be submitted — but never calls sbatch, qsub, bsub, etc. No jobs are submitted to any batch scheduler. This is essential for:
- CI/CD validation — verify workflow XML parses and dependency logic is correct without consuming HPC allocation
- Pre-flight checks — operators can confirm task ordering before committing to a production run
- Debugging — trace Rocoto's decision-making without side effects
- Added
WorkflowMgr.dryrun_mode?class method that checks theDRYRUNconstant - Defensive
const_defined?(:DRYRUN)guard so any execution context (including forked daemons) defaults tofalseinstead of crashing withNameError
def self.dryrun_mode?
return false unless const_defined?(:DRYRUN)
DRYRUN > 0
end- Added
-n/--dryrunoption torocotorun,rocotoboot,rocotorewind,rocotostat, and subset commands - Sets
WorkflowMgr.const_set("DRYRUN", 1)when flag is present,0when absent - Updated usage banner to include
[-n]
This is the core fix for the original revert:
| Proxy | Guard Added | Effect in Dryrun |
|---|---|---|
bqsproxy.rb |
&& !WorkflowMgr.dryrun_mode? |
Skip forking rocotobqserver; use in-process BQS object |
dbproxy.rb |
&& !WorkflowMgr.dryrun_mode? |
Skip forking rocotodbserver; use in-process database object |
workflowioproxy.rb |
&& !WorkflowMgr.dryrun_mode? |
Skip forking rocotoioserver; use in-process I/O object |
Without these guards, dryrun mode forked daemon processes that either crashed (missing DRYRUN constant) or created unnecessary DRb servers.
- Set
WorkflowMgr.const_set("DRYRUN", 0)in the shared server entry point used byrocotobqserver,rocotodbserver, androcotoioserver(via symlinks) - Ensures
dryrun_mode?works inside daemon processes when they are launched (non-dryrun mode)
| Code Path | Guard Added | Why |
|---|---|---|
@bqServer.__drburi (3 sites) |
&& !WorkflowMgr.dryrun_mode? |
No DRb server exists; use 0 as placeholder jobid |
@dbServer.add_cycles |
unless dryrun_mode? |
No database writes in dryrun |
@dbServer.update_cycles |
unless dryrun_mode? |
No database writes in dryrun |
@dbServer.add_jobs |
unless dryrun_mode? |
No database writes in dryrun |
@dbServer.delete_jobs |
unless dryrun_mode? |
No database writes in dryrun |
@dbServer.update_jobs |
unless dryrun_mode? |
No database writes in dryrun |
@dbServer.add_bqservers |
&& !WorkflowMgr.dryrun_mode? |
No BQS URI to register |
Thread.list.each { t.join } (2 sites) |
|| WorkflowMgr.dryrun_mode? |
Thread pool workers sleep indefinitely in dryrun causing deadlock |
harvest_pending_jobids |
!dryrun_mode? && before __drburi
|
Skip BQS server filter when no server exists |
ensure block server shutdown (3 sites) |
|| WorkflowMgr.dryrun_mode? |
No servers to shut down |
Dryrun-specific log messages added: "Dryrun: would submit..." and "would be booted (dryrun mode)".
- Same thread-join deadlock fix as the engine — skip
Thread.list.each { t.join }in dryrun mode
Each batch system's submit() method now checks dryrun_mode? and returns output="This is a dryrun" instead of executing the shell command:
| Scheduler | File | Command Guarded |
|---|---|---|
| Slurm | slurmbatchsystem.rb |
sbatch |
| PBS Pro | pbsprobatchsystem.rb |
qsub |
| Torque | torquebatchsystem.rb |
qsub |
| Moab | moabbatchsystem.rb |
msub |
| LSF | lsfbatchsystem.rb |
bsub |
| LSF Cray | lsfcraybatchsystem.rb |
bsub |
| Cobalt | cobaltbatchsystem.rb |
qsub |
lib/workflowmgr/utilities.rb +8
lib/workflowmgr/workflowoption.rb +10
lib/workflowmgr/reportoption.rb +11
lib/workflowmgr/workflowsubsetoptions.rb +1/-1
lib/workflowmgr/bqsproxy.rb +2/-1
lib/workflowmgr/dbproxy.rb +2/-1
lib/workflowmgr/workflowioproxy.rb +2/-1
lib/workflowmgr/workflowengine.rb +52/-19
lib/workflowmgr/workflowreport.rb +4/-1
lib/workflowmgr/slurmbatchsystem.rb +5/-1
lib/workflowmgr/pbsprobatchsystem.rb +5/-1
lib/workflowmgr/torquebatchsystem.rb +5/-1
lib/workflowmgr/moabbatchsystem.rb +5/-1
lib/workflowmgr/lsfbatchsystem.rb +5/-1
lib/workflowmgr/lsfcraybatchsystem.rb +5/-1
lib/workflowmgr/cobaltbatchsystem.rb +5/-1
sbin/rocotoserver +3
This feature branch is deployed and actively running across the NOAA Global Workflow GitLab CI/CD pipeline infrastructure on 5 RDHPCS platforms, exercised nightly by automated pipelines and on-demand by PR-triggered builds. The custom Rocoto build from feature/dryrun_nodaemon is injected into the pipeline PATH via the GFS_CI_ROCOTO_PATH environment variable, replacing the system-installed Rocoto for all rocotorun, rocotostat, and rocotocheck invocations.
| Platform | Scheduler | Custom Build Path | Status |
|---|---|---|---|
| Hera | Slurm | ${GFS_CI_UTIL_PATH}/src/rocoto-1.3.7-dryrun_nodaemon/bin |
Active |
| Orion | Slurm | ${GFS_CI_UTIL_PATH}/src/rocoto_dryrun_nodaemon/bin |
Active |
| Hercules | Slurm | ${GFS_CI_UTIL_PATH}/src/rocoto_dryrun_nodaemon/bin |
Active |
| Gaea C6 | Slurm | ${GFS_CI_UTIL_PATH}/rocoto_dryrun_nodaemon/bin |
Active |
| Ursa | Slurm | ${GFS_CI_UTIL_PATH}/src/rocoto-1.3.7-dryrun_nodaemon/bin |
Active |
| WCOSS2 | PBS | (commented out — unset GFS_CI_ROCOTO_PATH) |
Disabled |
Each platform's configuration lives in global-workflow/dev/ci/platforms/config.<platform> and exports GFS_CI_ROCOTO_PATH pointing to the feature branch build.
The custom Rocoto build is injected at three pipeline entry points via PATH prepend:
-
gitlab-ci-cases.yml(experiment setup and execution stage) — sources the platform config, then:if [[ -n "${GFS_CI_ROCOTO_PATH:-}" && -d "${GFS_CI_ROCOTO_PATH}" ]]; then export PATH="${GFS_CI_ROCOTO_PATH}:${PATH}" echo "Using custom Rocoto build at: $(which rocotorun 2>/dev/null || echo 'NOT FOUND')" fi
-
gitlab-ci-ctests.yml(CTest execution stage) — same injection in both the setup and run phases -
run_check_ci.sh/run_check_gitlab_ci.sh(experiment execution scripts) — invoke the Rocoto toolchain:-
rocotorun -v ${ROCOTO_VERBOSE:-0} -w <xml> -d <db>— launch/advance workflows -
rocotostat.py -w <xml> -d <db>— monitor workflow state and job counts -
rocotostat -d <db> -w <xml>— extract FAIL/DEAD job details -
rocotocheck -d <db> -w <xml>— retrieve error logs for failed tasks
-
The CI pipeline relies on global-workflow/dev/ci/scripts/utils/rocotostat.py to determine workflow completion, stall detection, and failure counts. This Python utility wraps rocotostat --summary and rocotostat --all with built-in retry logic (telescoping delays):
| Operation | Max Attempts | Delay Between |
|---|---|---|
rocotostat --summary |
3 | 120 sec |
rocotostat --all (job counts) |
4 | 120 sec |
rocotorun retry on stall |
2 | 120 sec |
Returns workflow state as one of: DONE, FAIL, RUNNING, STALLED, UNAVAILABLE, UNKNOWN with per-state job counts (SUCCEEDED, FAIL, DEAD, RUNNING, SUBMITTING, QUEUED). The --thread-logging flag monitors process thread counts during execution for detecting daemon thread leaks — directly relevant to the thread pool deadlock fix in this branch. Unit tests in dev/ci/scripts/unittests/test_rocotostat.py.
The nightly and PR-triggered pipelines execute a matrix of Global Workflow experiment cases per platform:
| Platform | Test Cases | Count |
|---|---|---|
| Hera | C48_ATM, C48_S2SW, C48_S2SWA_gefs, C48mx500_3DVarAOWCDA, C48mx500_hybAOWCDA, C96C48_hybatmDA, C96C48_hybatmsnowDA, C96C48_hybatmsoilDA, C96C48_ufsgsi_hybatmDA, C96C48_ufs_hybatmDA, C96C48mx500_S2SW_cyc_gfs, C96_atm3DVar, C96_gcafs_cycled, C96_gcafs_cycled_noDA, C96mx100_S2S, C48_gsienkf_atmDA, C48_ufsenkf_atmDA | 17 |
| Ursa | Same as Hera | 17 |
| Gaea C6 | Hera set minus DA variants | 15 |
| Hercules | Reduced DA set | 10 |
| Orion | Minimal set | 8 |
Each case exercises the full Rocoto lifecycle: XML parsing → dependency evaluation → job submission (rocotorun) → status polling (rocotostat.py) → error extraction (rocotocheck) → completion/failure determination. The feature branch build handles all of these operations across all active platforms.
| PR | Commit | Status | Notes |
|---|---|---|---|
| #117 | 438143b |
Merged → Reverted | Original --dryrun implementation |
| #119 | 79304a1 |
Merged | Reverted #117 due to daemon interaction issues |
| This PR |
abba59a (HEAD) |
Ready | Re-applies #117 with all daemon-layer fixes |
The original PR missed a critical architectural detail: Rocoto's BatchQueueServer, DatabaseServer, and WorkflowIOServer config flags cause rocotorun to fork child daemon processes via WorkflowMgr.launchServer(). These daemons run sbin/rocotoserver (shared via symlinks) which never set the DRYRUN constant. When submit() inside the daemon checked WorkflowMgr.dryrun_mode?, it hit an uninitialized constant NameError, silently killing the thread pool task and leaving jobs stuck at PENDING forever.
- Daemon suppression — Don't fork DRb servers at all in dryrun mode (3 proxy files)
-
Constant initialization — Set
DRYRUN=0inrocotoserverfor non-dryrun daemon processes -
Defensive helper —
const_defined?guard prevents NameError in any context -
Engine hardening — 10 guard sites in
workflowengine.rbto skip DB writes, DRb URI access, and thread joins -
Deadlock prevention — Thread pool workers sleep forever in dryrun; skip
Thread.jointo avoid hang