Logbook 2026 H1 - cardano-scaling/hydra GitHub Wiki

February 2026

2026-02-25

SB on decommit race bug

The Bug: Wrong Snapshot Number After DecommitFinalized/CommitFinalized

Root Cause

The Hydra Head state tracks two different snapshot-related pieces of state:

confirmedSnapshot - The last snapshot that was fully confirmed (all signatures collected, finalized)
seenSnapshot - Tracks the current snapshot activity (what's being requested/signed)

These two can get out of sync when chain events arrive at different times than off-chain consensus completes.

The Scenario That Triggers The Bug

Let me walk through a concrete example:

Initial State: confirmedSnapshot.number = 0 seenSnapshot = LastSeenSnapshot{lastSeen = 0}

Step 1: Alice initiates a decommit ReqDec arrives → decommit stored in state

Step 2: Leader requests snapshot 1 with the decommit NetworkEffect: ReqSn(version=0, number=1, decommit=...) State change: SnapshotRequestDecided{snapshotNumber=1}

After aggregate: confirmedSnapshot.number = 0 (unchanged) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}

Step 3: All parties start signing snapshot 1 Party receives ReqSn → sends AckSn

After processing ReqSn: confirmedSnapshot.number = 0 (still unchanged) seenSnapshot = SeenSnapshot{snapshot.number=1, signatures=...}

Step 4: Snapshot 1 confirms (all AckSn received) After SnapshotConfirmed: confirmedSnapshot.number = 1 ✅ Updated! seenSnapshot = LastSeenSnapshot{lastSeen=1}

Step 5: DecrementTx posted to chain Effect: PostTxOnChain DecrementTx (posts the decommit transaction to L1)

Step 6: 🔥 THE RACE CONDITION 🔥

This is where the bug happens! There are two possible orderings:

Ordering A (Normal - No Bug):

All AckSn messages arrive → SnapshotConfirmed
DecommitFinalized observed on chain

Final state: confirmedSnapshot.number = 1 seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Everything in sync

Ordering B (Race - BUG!):

DecommitFinalized observed on chain BEFORE AckSn completes
Chain event arrives while still in RequestedSnapshot or SeenSnapshot state

Current state when DecommitFinalized arrives: confirmedSnapshot.number = 0 (snapshot not confirmed yet!) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}

What happens in the DecommitFinalized aggregate:

DecommitFinalized{chainState, newVersion} -> coordinatedHeadState = coordinatedHeadState { decommitTx = Nothing , version = newVersion , seenSnapshot = toLastSeenSnapshot (seenSnapshot coordinatedHeadState) -- 👆 This is the key function }

The toLastSeenSnapshot function (exists in BOTH master and this branch):

toLastSeenSnapshot :: SeenSnapshot tx -> SeenSnapshot tx toLastSeenSnapshot = \case RequestedSnapshot{requested} -> LastSeenSnapshot{lastSeen = requested} -- Uses 'requested', not 'lastSeen'! -- 👆 So requested=1 becomes lastSeen=1 SeenSnapshot{snapshot = Snapshot{number}} -> LastSeenSnapshot{lastSeen = number} -- ... other cases

After DecommitFinalized aggregate: confirmedSnapshot.number = 0 ❌ Still at old value! seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Updated to snapshot that was being processed

🔥 seenSnapshot is AHEAD of confirmedSnapshot! 🔥

Step 7: New L2 transaction arrives

An L2 transaction arrives and triggers onOpenNetworkReqTx:

MASTER CODE (BUGGY):

onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (confirmedSn + 1) -- 👆 Uses confirmedSn only! -- = 0 + 1 = 1

maybeRequestSnapshot nextSn outcome = if not (snapshotInFlight seenSnapshot nextSn) && isLeader ... then -- Emit ReqSn for snapshot 1 cause (NetworkEffect $ ReqSn version 1 ...)

Step 8: ReqSn 1 is broadcast and received

When the ReqSn arrives back at the node, it goes through validation in onOpenNetworkReqSn:

onOpenNetworkReqSn ... = requireReqSn $ \continue -> -- Validation checks where seenSn = seenSnapshotNumber seenSnapshot -- = 1

    requireReqSn continue
      | sn /= seenSn + 1 =
          Error $ RequireFailed $ ReqSnNumberInvalid{requestedSn = sn, lastSeenSn = seenSn}
          --      👆 1 /= 1 + 1  →  1 /= 2  →  TRUE  →  ERROR!

💥 BUG MANIFESTS:

Trying to request snapshot 1
But seenSn = 1 (from LastSeenSnapshot{lastSeen=1})
Validation expects sn = seenSn + 1 = 2
Error: ReqSnNumberInvalid{requestedSn=1, lastSeenSn=1}
The head can't request new snapshots → STUCK!

The Fix

THIS BRANCH CODE (FIXED):

onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (max confirmedSn (latestSeenSnapshotNumber seenSnapshot) + 1) -- 👆 NEW: Use max of both!

New helper function: latestSeenSnapshotNumber :: SeenSnapshot tx -> SnapshotNumber latestSeenSnapshotNumber = \case NoSeenSnapshot -> 0 LastSeenSnapshot{lastSeen} -> lastSeen RequestedSnapshot{lastSeen} -> lastSeen -- Use lastSeen (confirmed), not requested! SeenSnapshot{snapshot = Snapshot{number}} -> number - 1 -- Snapshot N is being signed, confirmed is N-1

With the fix: confirmedSn = 0 seenSn = latestSeenSnapshotNumber (LastSeenSnapshot{lastSeen=1}) = 1

nextSn = max(0, 1) + 1 = 2 ✅ CORRECT!

Step 9: ReqSn 2 is broadcast and validated requireReqSn: sn = 2 seenSn = 1 Check: 2 /= 1 + 1 → 2 /= 2 → FALSE ✅ Validation passes!

Why This Fix Works

The fix works because it correctly handles all possible states: ┌──────────────────────────────┬─────────────┬────────┬─────────────────┬─────────────────┬───────────┐ │ Scenario │ confirmedSn │ seenSn │ nextSn (master) │ nextSn (fixed) │ Result │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal flow │ 0 │ 0 │ 0+1=1 ✅ │ max(0,0)+1=1 ✅ │ Both work │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ After DecommitFinalized race │ 0 │ 1 │ 0+1=1 ❌ │ max(0,1)+1=2 ✅ │ Fixed! │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal after confirm │ 1 │ 1 │ 1+1=2 ✅ │ max(1,1)+1=2 ✅ │ Both work │ └──────────────────────────────┴─────────────┴────────┴─────────────────┴─────────────────┴───────────┘ Related Changes

The same fix was applied everywhere snapshot numbers are calculated:

onOpenNetworkReqTx (L2 transactions)
onOpenNetworkAckSn (after snapshot confirms, request next if leader + has txs)
onOpenChainTick (periodic snapshot requests for deposits)

Summary

The bug: After DecommitFinalized/CommitFinalized arrives via chain before off-chain snapshot consensus completes, seenSnapshot gets ahead of confirmedSnapshot, causing wrong snapshot number calculation.

The fix: Use max(confirmedSn, latestSeenSnapshotNumber(seenSnapshot)) + 1 to always calculate the correct next snapshot number regardless of which state is ahead.

Why it matters: Under heavy L2 load, snapshots complete rapidly and chain events can easily arrive out-of-order, making this race condition common rather than rare. The Bug: Wrong Snapshot Number After DecommitFinalized/CommitFinalized

Root Cause

The Hydra Head state tracks two different snapshot-related pieces of state:

confirmedSnapshot - The last snapshot that was fully confirmed (all signatures collected, finalized)
seenSnapshot - Tracks the current snapshot activity (what's being requested/signed)