Logbook 2026 H1 - cardano-scaling/hydra GitHub Wiki

February 2026

2026-02-25

SB on decommit race bug

The Bug: Wrong Snapshot Number After DecommitFinalized/CommitFinalized

Root Cause

The Hydra Head state tracks two different snapshot-related pieces of state:

  1. confirmedSnapshot - The last snapshot that was fully confirmed (all signatures collected, finalized)
  2. seenSnapshot - Tracks the current snapshot activity (what's being requested/signed)

These two can get out of sync when chain events arrive at different times than off-chain consensus completes.

The Scenario That Triggers The Bug

Let me walk through a concrete example:

Initial State: confirmedSnapshot.number = 0 seenSnapshot = LastSeenSnapshot{lastSeen = 0}

Step 1: Alice initiates a decommit ReqDec arrives → decommit stored in state

Step 2: Leader requests snapshot 1 with the decommit NetworkEffect: ReqSn(version=0, number=1, decommit=...) State change: SnapshotRequestDecided{snapshotNumber=1}

After aggregate: confirmedSnapshot.number = 0 (unchanged) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}

Step 3: All parties start signing snapshot 1 Party receives ReqSn → sends AckSn

After processing ReqSn: confirmedSnapshot.number = 0 (still unchanged) seenSnapshot = SeenSnapshot{snapshot.number=1, signatures=...}

Step 4: Snapshot 1 confirms (all AckSn received) After SnapshotConfirmed: confirmedSnapshot.number = 1 ✅ Updated! seenSnapshot = LastSeenSnapshot{lastSeen=1}

Step 5: DecrementTx posted to chain Effect: PostTxOnChain DecrementTx (posts the decommit transaction to L1)

Step 6: 🔥 THE RACE CONDITION 🔥

This is where the bug happens! There are two possible orderings:

Ordering A (Normal - No Bug):

  1. All AckSn messages arrive → SnapshotConfirmed
  2. DecommitFinalized observed on chain

Final state: confirmedSnapshot.number = 1 seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Everything in sync

Ordering B (Race - BUG!):

  1. DecommitFinalized observed on chain BEFORE AckSn completes
  2. Chain event arrives while still in RequestedSnapshot or SeenSnapshot state

Current state when DecommitFinalized arrives: confirmedSnapshot.number = 0 (snapshot not confirmed yet!) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}

What happens in the DecommitFinalized aggregate:

DecommitFinalized{chainState, newVersion} -> coordinatedHeadState = coordinatedHeadState { decommitTx = Nothing , version = newVersion , seenSnapshot = toLastSeenSnapshot (seenSnapshot coordinatedHeadState) -- 👆 This is the key function }

The toLastSeenSnapshot function (exists in BOTH master and this branch):

toLastSeenSnapshot :: SeenSnapshot tx -> SeenSnapshot tx toLastSeenSnapshot = \case RequestedSnapshot{requested} -> LastSeenSnapshot{lastSeen = requested} -- Uses 'requested', not 'lastSeen'! -- 👆 So requested=1 becomes lastSeen=1 SeenSnapshot{snapshot = Snapshot{number}} -> LastSeenSnapshot{lastSeen = number} -- ... other cases

After DecommitFinalized aggregate: confirmedSnapshot.number = 0 ❌ Still at old value! seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Updated to snapshot that was being processed

🔥 seenSnapshot is AHEAD of confirmedSnapshot! 🔥

Step 7: New L2 transaction arrives

An L2 transaction arrives and triggers onOpenNetworkReqTx:

MASTER CODE (BUGGY):

onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (confirmedSn + 1) -- 👆 Uses confirmedSn only! -- = 0 + 1 = 1

maybeRequestSnapshot nextSn outcome = if not (snapshotInFlight seenSnapshot nextSn) && isLeader ... then -- Emit ReqSn for snapshot 1 cause (NetworkEffect $ ReqSn version 1 ...)

Step 8: ReqSn 1 is broadcast and received

When the ReqSn arrives back at the node, it goes through validation in onOpenNetworkReqSn:

onOpenNetworkReqSn ... = requireReqSn $ \continue -> -- Validation checks where seenSn = seenSnapshotNumber seenSnapshot -- = 1

    requireReqSn continue
      | sn /= seenSn + 1 =
          Error $ RequireFailed $ ReqSnNumberInvalid{requestedSn = sn, lastSeenSn = seenSn}
          --      👆 1 /= 1 + 1  →  1 /= 2  →  TRUE  →  ERROR!

💥 BUG MANIFESTS:

  • Trying to request snapshot 1
  • But seenSn = 1 (from LastSeenSnapshot{lastSeen=1})
  • Validation expects sn = seenSn + 1 = 2
  • Error: ReqSnNumberInvalid{requestedSn=1, lastSeenSn=1}
  • The head can't request new snapshots → STUCK!

The Fix

THIS BRANCH CODE (FIXED):

onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (max confirmedSn (latestSeenSnapshotNumber seenSnapshot) + 1) -- 👆 NEW: Use max of both!

New helper function: latestSeenSnapshotNumber :: SeenSnapshot tx -> SnapshotNumber latestSeenSnapshotNumber = \case NoSeenSnapshot -> 0 LastSeenSnapshot{lastSeen} -> lastSeen RequestedSnapshot{lastSeen} -> lastSeen -- Use lastSeen (confirmed), not requested! SeenSnapshot{snapshot = Snapshot{number}} -> number - 1 -- Snapshot N is being signed, confirmed is N-1

With the fix: confirmedSn = 0 seenSn = latestSeenSnapshotNumber (LastSeenSnapshot{lastSeen=1}) = 1

nextSn = max(0, 1) + 1 = 2 ✅ CORRECT!

Step 9: ReqSn 2 is broadcast and validated requireReqSn: sn = 2 seenSn = 1 Check: 2 /= 1 + 1 → 2 /= 2 → FALSE ✅ Validation passes!

Why This Fix Works

The fix works because it correctly handles all possible states: ┌──────────────────────────────┬─────────────┬────────┬─────────────────┬─────────────────┬───────────┐ │ Scenario │ confirmedSn │ seenSn │ nextSn (master) │ nextSn (fixed) │ Result │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal flow │ 0 │ 0 │ 0+1=1 ✅ │ max(0,0)+1=1 ✅ │ Both work │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ After DecommitFinalized race │ 0 │ 1 │ 0+1=1 ❌ │ max(0,1)+1=2 ✅ │ Fixed! │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal after confirm │ 1 │ 1 │ 1+1=2 ✅ │ max(1,1)+1=2 ✅ │ Both work │ └──────────────────────────────┴─────────────┴────────┴─────────────────┴─────────────────┴───────────┘ Related Changes

The same fix was applied everywhere snapshot numbers are calculated:

  1. onOpenNetworkReqTx (L2 transactions)
  2. onOpenNetworkAckSn (after snapshot confirms, request next if leader + has txs)
  3. onOpenChainTick (periodic snapshot requests for deposits)

Summary

The bug: After DecommitFinalized/CommitFinalized arrives via chain before off-chain snapshot consensus completes, seenSnapshot gets ahead of confirmedSnapshot, causing wrong snapshot number calculation.

The fix: Use max(confirmedSn, latestSeenSnapshotNumber(seenSnapshot)) + 1 to always calculate the correct next snapshot number regardless of which state is ahead.

Why it matters: Under heavy L2 load, snapshots complete rapidly and chain events can easily arrive out-of-order, making this race condition common rather than rare. The Bug: Wrong Snapshot Number After DecommitFinalized/CommitFinalized

Root Cause

The Hydra Head state tracks two different snapshot-related pieces of state:

  1. confirmedSnapshot - The last snapshot that was fully confirmed (all signatures collected, finalized)
  2. seenSnapshot - Tracks the current snapshot activity (what's being requested/signed)

These two can get out of sync when chain events arrive at different times than off-chain consensus completes.

The Scenario That Triggers The Bug

Let me walk through a concrete example:

Initial State: confirmedSnapshot.number = 0 seenSnapshot = LastSeenSnapshot{lastSeen = 0}

Step 1: Alice initiates a decommit ReqDec arrives → decommit stored in state

Step 2: Leader requests snapshot 1 with the decommit NetworkEffect: ReqSn(version=0, number=1, decommit=...) State change: SnapshotRequestDecided{snapshotNumber=1}

After aggregate: confirmedSnapshot.number = 0 (unchanged) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}

Step 3: All parties start signing snapshot 1 Party receives ReqSn → sends AckSn

After processing ReqSn: confirmedSnapshot.number = 0 (still unchanged) seenSnapshot = SeenSnapshot{snapshot.number=1, signatures=...}

Step 4: Snapshot 1 confirms (all AckSn received) After SnapshotConfirmed: confirmedSnapshot.number = 1 ✅ Updated! seenSnapshot = LastSeenSnapshot{lastSeen=1}

Step 5: DecrementTx posted to chain Effect: PostTxOnChain DecrementTx (posts the decommit transaction to L1)

Step 6: 🔥 THE RACE CONDITION 🔥

This is where the bug happens! There are two possible orderings:

Ordering A (Normal - No Bug):

  1. All AckSn messages arrive → SnapshotConfirmed
  2. DecommitFinalized observed on chain

Final state: confirmedSnapshot.number = 1 seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Everything in sync

Ordering B (Race - BUG!):

  1. DecommitFinalized observed on chain BEFORE AckSn completes
  2. Chain event arrives while still in RequestedSnapshot or SeenSnapshot state

Current state when DecommitFinalized arrives: confirmedSnapshot.number = 0 (snapshot not confirmed yet!) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}

What happens in the DecommitFinalized aggregate:

DecommitFinalized{chainState, newVersion} -> coordinatedHeadState = coordinatedHeadState { decommitTx = Nothing , version = newVersion , seenSnapshot = toLastSeenSnapshot (seenSnapshot coordinatedHeadState) -- 👆 This is the key function }

The toLastSeenSnapshot function (exists in BOTH master and this branch):

toLastSeenSnapshot :: SeenSnapshot tx -> SeenSnapshot tx toLastSeenSnapshot = \case RequestedSnapshot{requested} -> LastSeenSnapshot{lastSeen = requested} -- Uses 'requested', not 'lastSeen'! -- 👆 So requested=1 becomes lastSeen=1 SeenSnapshot{snapshot = Snapshot{number}} -> LastSeenSnapshot{lastSeen = number} -- ... other cases

After DecommitFinalized aggregate: confirmedSnapshot.number = 0 ❌ Still at old value! seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Updated to snapshot that was being processed

🔥 seenSnapshot is AHEAD of confirmedSnapshot! 🔥

Step 7: New L2 transaction arrives

An L2 transaction arrives and triggers onOpenNetworkReqTx:

MASTER CODE (BUGGY):

onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (confirmedSn + 1) -- 👆 Uses confirmedSn only! -- = 0 + 1 = 1

maybeRequestSnapshot nextSn outcome = if not (snapshotInFlight seenSnapshot nextSn) && isLeader ... then -- Emit ReqSn for snapshot 1 cause (NetworkEffect $ ReqSn version 1 ...)

Step 8: ReqSn 1 is broadcast and received

When the ReqSn arrives back at the node, it goes through validation in onOpenNetworkReqSn:

onOpenNetworkReqSn ... = requireReqSn $ \continue -> -- Validation checks where seenSn = seenSnapshotNumber seenSnapshot -- = 1

    requireReqSn continue
      | sn /= seenSn + 1 =
          Error $ RequireFailed $ ReqSnNumberInvalid{requestedSn = sn, lastSeenSn = seenSn}
          --      👆 1 /= 1 + 1  →  1 /= 2  →  TRUE  →  ERROR!

💥 BUG MANIFESTS:

  • Trying to request snapshot 1
  • But seenSn = 1 (from LastSeenSnapshot{lastSeen=1})
  • Validation expects sn = seenSn + 1 = 2
  • Error: ReqSnNumberInvalid{requestedSn=1, lastSeenSn=1}
  • The head can't request new snapshots → STUCK!

The Fix

THIS BRANCH CODE (FIXED):

onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (max confirmedSn (latestSeenSnapshotNumber seenSnapshot) + 1) -- 👆 NEW: Use max of both!

New helper function: latestSeenSnapshotNumber :: SeenSnapshot tx -> SnapshotNumber latestSeenSnapshotNumber = \case NoSeenSnapshot -> 0 LastSeenSnapshot{lastSeen} -> lastSeen RequestedSnapshot{lastSeen} -> lastSeen -- Use lastSeen (confirmed), not requested! SeenSnapshot{snapshot = Snapshot{number}} -> number - 1 -- Snapshot N is being signed, confirmed is N-1

With the fix: confirmedSn = 0 seenSn = latestSeenSnapshotNumber (LastSeenSnapshot{lastSeen=1}) = 1

nextSn = max(0, 1) + 1 = 2 ✅ CORRECT!

Step 9: ReqSn 2 is broadcast and validated requireReqSn: sn = 2 seenSn = 1 Check: 2 /= 1 + 1 → 2 /= 2 → FALSE ✅ Validation passes!

Why This Fix Works

The fix works because it correctly handles all possible states: ┌──────────────────────────────┬─────────────┬────────┬─────────────────┬─────────────────┬───────────┐ │ Scenario │ confirmedSn │ seenSn │ nextSn (master) │ nextSn (fixed) │ Result │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal flow │ 0 │ 0 │ 0+1=1 ✅ │ max(0,0)+1=1 ✅ │ Both work │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ After DecommitFinalized race │ 0 │ 1 │ 0+1=1 ❌ │ max(0,1)+1=2 ✅ │ Fixed! │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal after confirm │ 1 │ 1 │ 1+1=2 ✅ │ max(1,1)+1=2 ✅ │ Both work │ └──────────────────────────────┴─────────────┴────────┴─────────────────┴─────────────────┴───────────┘ Related Changes

The same fix was applied everywhere snapshot numbers are calculated:

  1. onOpenNetworkReqTx (L2 transactions)
  2. onOpenNetworkAckSn (after snapshot confirms, request next if leader + has txs)
  3. onOpenChainTick (periodic snapshot requests for deposits)

Summary

The bug: After DecommitFinalized/CommitFinalized arrives via chain before off-chain snapshot consensus completes, seenSnapshot gets ahead of confirmedSnapshot, causing wrong snapshot number calculation.

The fix: Use max(confirmedSn, latestSeenSnapshotNumber(seenSnapshot)) + 1 to always calculate the correct next snapshot number regardless of which state is ahead.

Why it matters: Under heavy L2 load, snapshots complete rapidly and chain events can easily arrive out-of-order, making this race condition common rather than rare.

2026-02-12

SB on Decommit Stuck Head Analysis (DeltaDeFi bug)

Bug Summary

Under heavy L2 transaction load, a decommit can permanently stuck the Hydra Head due to a race condition between the off-chain snapshot protocol and on-chain decrement observation.

Timeline (from hydra-node-alice-logs.txt)

Timestamp Log Line Event Key Data
10:09:47.157 247117 AckSn received → Snapshot confirmed Effects: DecrementTx (on-chain) + ReqSn(sn=2790, v=3)
10:09:47.412 247133 Outcome processed SnapshotConfirmed + DecommitApproved(txId=1a87c616...) + SnapshotRequestDecided(sn=2790)
10:09:47.919 247188 ReqSn sent on network sn=2790, version=3, includes decommitTx + 8 L2 txIds
10:09:47.919 247205 DecommitFinalized chain event newVersion=4 (bumped from 3)
10:09:48.165 247253 ReqSn received from network sn=2790, version=3
10:09:48.165 247254 ERROR: ReqSvNumberInvalid requestedSv=3, lastSeenSv=4
10:09:48+ 247265+ Only TransactionAppliedToLocalUTxO events No more snapshots — HEAD STUCK

Root Cause

The Race

  1. onOpenNetworkAckSn (HeadLogic.hs:607) confirms a snapshot. This triggers maybeRequestNextSnapshot (line 687) which creates SnapshotRequestDecided{sn=2790} and sends ReqSn(v=3, sn=2790).

  2. Almost simultaneously, the DecommitFinalized chain event is processed. The aggregate function (HeadLogic.hs:2003-2016) updates version from 3 to 4 and clears decommitTx.

  3. When ReqSn(v=3, sn=2790) is received (even by the sender itself), onOpenNetworkReqSn (HeadLogic.hs:461) checks sv /= version3 /= 4rejects with ReqSvNumberInvalid.

  4. The error is discarded with no state change. But seenSnapshot remains RequestedSnapshot{lastSeen=2789, requested=2790}.

Why No Recovery

  • snapshotInFlight (HeadLogic.hs:360-364) returns True for RequestedSnapshot state
  • maybeRequestSnapshot (line 331) and maybeRequestNextSnapshot (line 687) both check not snapshotInFlight
  • Since seenSnapshot is stuck in RequestedSnapshot, no new snapshot can ever be requested
  • L2 transactions keep arriving but pile up without ever being snapshotted

The Missing Piece

DecommitFinalized aggregate (HeadLogic.hs:2003-2016) bumps version but does NOT reset seenSnapshot:

DecommitFinalized{chainState, newVersion} ->
  case st of
    Open os@OpenState{coordinatedHeadState} ->
      Open os
        { chainState
        , coordinatedHeadState =
            coordinatedHeadState
              { decommitTx = Nothing
              , version = newVersion
              -- BUG: seenSnapshot is NOT reset here
              }
        }
    _otherState -> st

The same issue exists in CommitFinalized (HeadLogic.hs:1753-1781) for the increment path.

Fix

Reset seenSnapshot to LastSeenSnapshot when DecommitFinalized or CommitFinalized bump the version:

DecommitFinalized{chainState, newVersion} ->
  case st of
    Open os@OpenState{coordinatedHeadState} ->
      Open os
        { chainState
        , coordinatedHeadState =
            coordinatedHeadState
              { decommitTx = Nothing
              , version = newVersion
              , seenSnapshot =
                  LastSeenSnapshot
                    { lastSeen = seenSnapshotNumber (seenSnapshot coordinatedHeadState)
                    }
              }
        }
    _otherState -> st

Why This Is Safe

  1. Old AckSn messages: Dropped by waitOnSeenSnapshot (noop if sn <= lastSeen) or put in wait. Old signatures won't verify against a new-version snapshot anyway.
  2. All parties see chain events: DecommitFinalized is observed by all nodes simultaneously, so all reset their state.
  3. Automatic recovery: Under load, the next ReqTx triggers maybeRequestSnapshot which now sees snapshotInFlight = False and re-requests with updated version.

2026-02-06

SB on DeltaDeFi bugs

Snapshot Stuck Analysis: SnapshotConfirmed stops appearing

Summary

After a certain point in the Alice node logs, SnapshotConfirmed events stop appearing entirely. The node enters an infinite loop of ReqSn followed by WaitOnDepositObserved, processing ~18,000 retries with no progress. The root cause is a stale currentDepositTxId in CoordinatedHeadState that references a deposit already recovered on-chain.

Timeline

Time Line Event Detail
06:39 10625 OnDepositTx Deposit 5655641e observed on-chain
06:39 10626 DepositRecorded Added to pendingDeposits
07:14 11087 DepositActivated Became active, triggered ReqSn sn=21 deposit=5655641e
07:14 11121 CommitApproved Snapshot sn=21 confirmed with this deposit
07:47 11874 DepositExpired Deposit expired
07:49-08:17 11944-12673 PostTxOnChainFailed Multiple failed RecoverTx attempts
08:23 12863 OnRecoverTx Deposit recovered on-chain
08:23 12864 DepositRecovered Removed from pendingDeposits
08:28 12911 ReqSn sn=22 Still references deposit=5655641e -- stuck begins
08:37 29453 SideLoadSnapshot State reset via side-load
08:37 29456 LocalStateCleared Clears localTxs, allTxs, seenSnapshot only
09:45 50047 SideLoadSnapshot Another side-load
09:45 50050 LocalStateCleared Same partial clear
09:46 50062 SnapshotConfirmed Last ever SnapshotConfirmed (from side-loaded replay)
09:46 50089 SnapshotRequestDecided sn=22 Node decides to request snapshot again
09:46 50094 ReqSn sn=22 deposit=5655641e Still references the recovered deposit
09:46+ 50097+ WaitOnDepositObserved forever Deposit not in pendingDeposits, infinite retry loop (18,003 times)

Deposit Lifecycle Comparison

Three deposits were referenced in ReqSn messages during this session:

Deposit Recorded Activated CommitApproved OnIncrementTx CommitFinalized Recovered
866fdf31 04:34 yes yes 05:08 (v=1) 05:09 -
b34d4dbb 05:14 yes yes 05:48 (v=3) 05:49 -
5655641e 06:39 07:14 07:14 never never 08:23

Deposit 5655641e was approved but the IncrementTx was never observed on-chain, so CommitFinalized never fired. The normal path to clear currentDepositTxId (via CommitFinalized setting it to Nothing) never executed. Meanwhile the deposit was recovered on-chain, removing it from pendingDeposits but leaving the stale reference in CoordinatedHeadState.currentDepositTxId.

Why OnIncrementTx Was Never Observed: Chain Rollback

The key question is: if CommitApproved happened (meaning the snapshot was confirmed with the deposit), why was the IncrementTx never observed on-chain?

The detailed sequence around the deposit reveals the answer:

Time Line Event Detail
07:14:13.482 11103 ReqSn sn=21 Snapshot request with deposit=5655641e
07:14:13.503 11104 AckSn sn=21 Acknowledged
07:14:13.565 11109 SnapshotRequested Snapshot seen
07:14:13.606 11116 IncrementTx IncrementTx posted on-chain
07:14:13.631 11119 PartySignedSnapshot Signature collected
07:14:13.643 11121 SnapshotConfirmed Snapshot sn=21 confirmed
07:14:13.716 11131 CommitApproved Commit approved, more IncrementTx postings follow
07:14:16.751 11142 SnapshotConfirmed Last confirmed snapshot in this window
... (ticks) Normal tick processing, deposit still active
07:20:58.651 11272 Rollback Chain rolled back -- IncrementTx observation erased
07:20:58.653 11273 ChainRolledBack Rollback applied to head state
... (ticks) No more SnapshotConfirmed or OnIncrementTx after this
07:47:26.378 11874 DepositExpired Deposit expires before increment re-lands

The IncrementTx was posted and initially appeared on-chain, but the chain rollback at 07:20 erased it. After the rollback:

  • OnIncrementTx was never re-observed for this deposit
  • The node does not re-post the IncrementTx after a rollback -- it only posts it once as a side-effect of SnapshotConfirmed via maybePostIncrementTx in onOpenNetworkAckSn
  • CommitFinalized never happened, so currentDepositTxId was never cleared
  • The deposit expired (07:47) and was eventually recovered on-chain (08:23)
  • But currentDepositTxId remained stale, poisoning all future ReqSn messages

Root Cause

Three issues in HeadLogic.hs:

1. LocalStateCleared does not reset currentDepositTxId or decommitTx

In the aggregate function (lines 1835-1857), the handler for LocalStateCleared resets localUTxO, localTxs, allTxs, and seenSnapshot, but leaves currentDepositTxId and decommitTx untouched:

LocalStateCleared{snapshotNumber} ->
    case st of
      Open os@OpenState{coordinatedHeadState = coordinatedHeadState@CoordinatedHeadState{confirmedSnapshot}} ->
        Open
          os
            { coordinatedHeadState =
                case confirmedSnapshot of
                  InitialSnapshot{initialUTxO} ->
                    coordinatedHeadState
                      { localUTxO = initialUTxO
                      , localTxs = mempty
                      , allTxs = mempty
                      , seenSnapshot = NoSeenSnapshot
                      }
                  ConfirmedSnapshot{snapshot = Snapshot{utxo}} ->
                    coordinatedHeadState
                      { localUTxO = utxo
                      , localTxs = mempty
                      , allTxs = mempty
                      , seenSnapshot = LastSeenSnapshot snapshotNumber
                      }
                      -- NOTE: currentDepositTxId and decommitTx are NOT cleared here
            }

After a SideLoadSnapshot triggers LocalStateCleared, the stale currentDepositTxId persists. The next time the node decides to request a snapshot (via onOpenNetworkReqTx -> maybeRequestSnapshot at line 337, or onOpenChainTick at line 987), it includes this stale deposit ID in the ReqSn.

2. No validation that currentDepositTxId exists in pendingDeposits when

building ReqSn

In maybeRequestSnapshot (line 337) and onOpenChainTick (line 987), the currentDepositTxId from CoordinatedHeadState is included in the outgoing ReqSn without verifying the deposit still exists in pendingDeposits:

maxTxsPerSnapshot localTxs') decommitTx currentDepositTxId) ```

When the receiving side processes this `ReqSn` in `onOpenNetworkReqSn` ->
`waitForDeposit` (line 489), it looks up the deposit in `pendingDeposits`:

```haskell case Map.lookup depositTxId pendingDeposits of Nothing -> wait
WaitOnDepositObserved{depositTxId} ```

Since the deposit was recovered (removed from `pendingDeposits`), this returns
`WaitOnDepositObserved` on every retry, creating an infinite loop.

## Log Statistics

| Event                   | Count  | Notes                              |
| ----------------------- | ------ | ---------------------------------- |
| `ReqSn`                 | 18,075 | Dominated by retries of sn=22      |
| `WaitOnDepositObserved` | 18,003 | All for deposit `5655641e`         |
| `SnapshotConfirmed`     | 59     | Last one at line 50062             |
| `DepositExpired`        | 1,800  | Deposit re-evaluated on every tick |
| `DepositActivated`      | 385    | Before expiration                  |
| `PostTxOnChainFailed`   | 36     | Failed `RecoverTx` attempts        |

### 3. No re-posting of `IncrementTx` after chain rollback

In `onOpenNetworkAckSn`, the `IncrementTx` is posted on-chain as a one-shot
side-effect of `SnapshotConfirmed` via `maybePostIncrementTx` (line 692). If a
chain rollback erases the `IncrementTx` observation, there is no mechanism to
re-post it. The `handleChainInput` handler for `Open` state + `Rollback` only
calls `ChainRolledBack` to update the chain state -- it does not check whether
a pending `IncrementTx` needs to be resubmitted.

This is the triggering cause in this incident: the rollback at 07:20 erased the
`IncrementTx`, leaving the node in a state where `CommitApproved` had happened
but `CommitFinalized` never would.

## Fixes

1. **Re-post `IncrementTx` after chain rollback** -- when a rollback is
   observed and there is a `currentDepositTxId` set with a confirmed snapshot
containing `utxoToCommit`, the node should re-post the `IncrementTx`. This is
the primary fix that would have prevented the incident.

2. **Clear `currentDepositTxId` and `decommitTx` in `LocalStateCleared`
   aggregate handler** -- this prevents stale references from surviving a
side-load snapshot. Defense-in-depth.

3. **Clear `currentDepositTxId` when `DepositRecovered` is processed** -- in
   `aggregateNodeState`, the `DepositRecovered` handler removes the deposit
from `pendingDeposits` but does not clear `currentDepositTxId` in the
`CoordinatedHeadState` if it matches. This would break the infinite loop as a
fallback.

4. **Validate `currentDepositTxId` against `pendingDeposits` before including
   in `ReqSn`** -- if the deposit no longer exists, set it to `Nothing` in the
outgoing request. Another layer of defense.

## 2026-02-04

## 2026-02-04

### SB on DeltaDeFi reported bugs

- Ok after some hurdles I finally got some useful logs.
- Seems the problem timeline is something like this: Deposit gets expired, in
the code we do not remove it from pendingDeposits, deposit was recovered, then
SideLoadSnapshot occurs in order to unstuck the head but , Node waits forever
in order to see expected deposit since there is a ReqSn with depositTxId which
we already recovered.
- After some time the node stops waiting on deposit and user does a new tx
which leads to ReqTx but no ReqSn.
- And the old ReqSn is still in flight with the recovered depositTxId.
- I found one line in code which is problematic - we insert into pending
deposits on DepositExpired. We should keep track of recoverable deposits
separately I think to avoid this problem.
- Off chain code/head logic is so convoluted and hard to follow when user
problems like this occur. I wish we could keep it more simple.
- I think one thing could be not to keep pending deposits so that we can
recover. If the user hits the api to recover some deposit we should not keep it
in pending deposits but try to recover using provieded txid. If the deposit was
already spend than they get an error and that is it - pending deposits are
reserved only for ones that are still pending so not expired.
- Right now I need to write a test that exposes this problem (needs to be on
top of 1.2.0 since that's what DeltaDeFi are using).
- Did this in BehaviorSpec where I issued a NewTx after a DepositExpired and
saw the problem where multiple ReqSn messages are sent with populated
depositTxId.
- Applied fix that Delta-DeFi came up with to remove depositTx from the
coordinated head state after seeing a deposit expired - still no green because
we have a snapshot in flight that should be discarded now when the deposit is
expired.
- How to actually discard a snapshot that was already signed by participants? I
think something similar to SnapshotSideLoad should work out.
- Since it is tricky to capture all of the logs in BehaviorSpec I committed
just the failing test and want to reproduce the same error in e2e tests as
well. There I should be able to see more logging that is relevant.
- It might be tricky to write a test for DepositExpired - I see there are none
so far :sad-face
- We need to make sure we can run our e2e tests on different networks in order
to make sure all is working in the real world as well.
- It is hard to reproduce this bug on e2e. I need to have
`SnapshotRequestDecided` but on chain Tick (I think) since we also emit the
same message on AckSn handling. All of this is triggered on DepositActivated it
seems so very hard to come to the same setup (need to think about it).
- So there is ReqSn and AckSn in the logs for snapshot number 21 which contains
depositTxId. Then CommitApproved (for another deposit) so it seems there are
multiple deposits flying around.
- This will be very hard to reproduce since I need to have a snapshot in flight
when DepositExpired message is seen.
- Perhaps I could add some code for the node to ignore certain snapshot in
order to reproduce this?
- Seems like it is not that easy...
- Jackal mention they do IC -> ID -> IC -> ID -> IC fails so prehaps tomorrow I
try that order (and try to run it on preprod since devnet is soo fast)
- Oh another idea, slow devnet up by fiddling with the configuration?

##  2026-02-03

## SB on Time on L2

- We have two users reporting bugs with time validity on L2 transactions.
- One user might be just affected by this because of the offline Hydra node
while another one is claiming the slots on L2 are not correct therefore his
contract fails while doing L2 tx.
- I wanted to include aiken vesting contract (since thats what one of the users
used) into our testing toolchain but running into problems when trying to
commit the script UTxO into a Head:

ScriptFailedInWallet {redeemerPtr = "ConwaySpending (AsIx {unAsIx = 1})", failureReason = "ValidationFailure (WrapExUnits {unWrapExUnits = ExUnits' {exUnitsMem' = 0, exUnitsSteps' = 0}}) (CekError An error has occurred:
The machine terminated because of an error, either from a built-in function or from an explicit use of 'error'.
Caused by: unBData
(Constr 0
[ Constr 0
[ List
[ Constr 0
[ Constr 0
[ B #44d2f08229825ba3a3e260ecd2dcc9ac0af4da6462bcf5f435d0b0ccca90dbce
, I 1 ]
, Constr 0
[ Constr 0
[ Constr 1
[ B #c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045 ]
, Constr 1 [] ]
, Map
[ (B "))

- This is how the commit tx looks like before going into the wallet:

"0ac5466e63ef9527dfff23f485c3d1a10cc10d04f3e3f38275a2676602b9e18b"

== INPUTS (2)

  • 44d2f08229825ba3a3e260ecd2dcc9ac0af4da6462bcf5f435d0b0ccca90dbce#1 ShelleyAddress Testnet (ScriptHashObj (ScriptHash "c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045")) StakeRefNull 1293000 lovelace 1 d0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab.f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d TxOutDatumInline "0xd0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab" ReferenceScriptNone
  • 60d7f7796929f70379696147009478a6df0f1439d39b8bd2a8f5ea4df84a0101#0 ShelleyAddress Testnet (ScriptHashObj (ScriptHash "e8b53932e4a43630bed3893f186fe9e8a8391f45d951af082a726e4b")) StakeRefNull 7000000 lovelace TxOutDatumInline [0,[100,"0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8","0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8"]] ReferenceScriptNone

== COLLATERAL INPUTS (0)

== REFERENCE INPUTS (1)

  • df4967ec1d8358b1a4372d73880e85ed0cfa5d10428849fe93c276fa629ea781#0 ShelleyAddress Testnet (ScriptHashObj (ScriptHash "6a09cb22defaf4a96a6be1ef6c07467ac9923d1750a79214a06c503a")) StakeRefNull 12352460 lovelace TxOutDatumNone ReferenceScript PlutusScriptLanguage PlutusScriptV3 "c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045"

== OUTPUTS (1) Total number of assets: 2

  • ShelleyAddress Testnet (ScriptHashObj (ScriptHash "61458bc2f297fff3cc5df6ac7ab57cefd87763b0b7bd722146a1035c")) StakeRefNull 8293000 lovelace 1 d0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab.f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d TxOutDatumInline [0,["0xd5bf4a3fcce717b0388bcc2749ebc148ad9969b23f45ee1b605fd58778576ac4",[0,[[0,"0x60d7f7796929f70379696147009478a6df0f1439d39b8bd2a8f5ea4df84a0101",0,"0xd8799fd8799fd87a9f581ce8b53932e4a43630bed3893f186fe9e8a8391f45d951af082a726e4bffd87a80ffa140a1401a006acfc0d87b9fd8799f1864581c69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8581c69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8ffffd87a80ff"]]],"0xd0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab"]]

== TOTAL COLLATERAL TxTotalCollateralNone

== RETURN COLLATERAL TxReturnCollateralNone

== FEE TxFeeExplicit ShelleyBasedEraConway (Coin 0)

== VALIDITY TxValidityNoLowerBound TxValidityUpperBound ShelleyBasedEraConway Nothing

== MINT/BURN 0 lovelace

== SCRIPTS (1) Total size (bytes): 2622

  • Script (ScriptHash "e8b53932e4a43630bed3893f186fe9e8a8391f45d951af082a726e4b")

== DATUMS (1)

  • "d1120c6cb2453a0cc4b6fc6dd19f6bea76b779a3399450560aff181fe5f658bd" [0,[100,"0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8","0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8"]]

== REDEEMERS (2)

== REQUIRED SIGNERS

  • "f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d"

== METADATA TxMetadataInEra ShelleyBasedEraConway (TxMetadata {unTxMetadata = fromList [(55555,TxMetaText "HydraV1/CommitTx")]})

- So it looks like commit tx was assembled and then when we want to balance it
there is an error with deconstructing the Data for redeemer pointer 1, this
one:
  • ConwaySpending (AsIx {unAsIx = 1}) ( cpu = 0, mem = 0 ) [0,[]]
- I made sure the vesting validator returns true always but perhaps there is
some bug in constructing the commit and mangling the redeemers which is what we
do there.
- What is weird I get unBData errors by using datum hashes and also inline
datums.
- I'll try quickly to replace the aiken datum with something simple and see if
that solves the issue. I don't think it should since I am already able to
create a vesting output and now when I want to commit it to Hydra all of this
happens. Nope.
- Leaving this for now since I don't want to waste time doing this when we
don't know for sure time reporting has a bug. Revisit later if need be.


# January 2026 

## 2026-01-27

### SB on tx validity ranges on L2

- We have a user that fails to submit L2 tx because of errors in the tx validity (they are spending the script utxo on L2 and the script checks the validity range of a tx)
- We also have TinyCat report similar issue which was resolved by providing an endpoint which returns the head initialization time. Than he uses this info in order to calculate 
the tx validity which indicates some sort of bug in Hydra.
- I'll try to follow the code path and see what happens when we receive a `NewTx` client input:
  

on NewTx calls onOpenClientNewTx onOpenClientNewTx is called from handleClientInput handleClientInput is called from updateSyncedHead updateSyncedHead is called from update where the current slot we get from NodeInSync message

So it seems like when we emit NodeInSync message the reported slot is somehow depending on the head initialization.
- I can't tell just from looking at the code where the bug is. It is probably better to try and write a test that exercises script utxo spending and observe the 
slot behavior.


### SB on unstable incremental commit processing 

- I should look at this one as the most urgent thing https://github.com/cardano-scaling/hydra/issues/2446 
- This happens on some stress tests when there are many `NewTx's` sent out while doing a incremental commit in between. Then version mismatch prevents new snapshots from being created and the Head is stuck.
- We bump the version upon observing increment tx and if there is a `ReqTx` before this version bump and a `ReqSn` right after then the snapshot version is not in line and the head is waiting forever to create a new snapshot.
- We can of course fix this but the real question is what would be the most elegant and secure fix?
- I think I should reproduce this first locally with the instructions NS provided which are:

nix run .#demo commit on alice, nothing on bob, carol run hydra-txn-respender for alice increment on bob waitobserve picked up commit head stalled; mismatch around requestedSnapshot

- I thought I could write a test in the BehaviourSpec but we don't get to control network messages there. It would be useful to have a test suite capable of saying _when you see ReqTx do this_.
- Instead I wrote a e2e test that spamms the L2 with txs and when doing incremental commit my hope is to see a stuck snapshot.
- Indeed we don't see any confirmed snapshot after `CommitFinalized`. I'll check the logs to see if spamming of txs continues after the deposit so that the next snapshot should be produced.
- Seems like spamming doesn't work, I see only one NewTx. I would need to chain txs for this to work out.
- There is a function `respendUTxO` I could use.
- I see...the function waits for `SnapshotConfirmed` which is not what I want.
- I am able to spam hydra-node with `NewTx` messages but what I also want it to see some of them after the deposit.
- Perhaps I need to do a incremental commit concurrently?
- Trying this I still don't see a valid tx after the increment, probably need to find a way to update the utxo I am trying to spend. What happens if I use different keys for new tx and deposit? Surely I can spend part of the head UTxO? Hmm...I think when applying the txs 
we take into account the complete UTxO so the next tx after deposit is invalid if I am spending just part of it?
- I will try to catch the exception that comes from waiting on `TxValid` and retry to see if I get the updated snapshot utxo info to spend it.
- I still can't get to what I want which is to keep sending bunch of txs and in the middle do just one deposit and assert we get to a snapshot with version 1.
- Using `forkIO` to run two actions in a backgroung threads produces better results. Test fails while waiting to see a snapshot with version 1 now but I see `TxValid` messages after the deposit was made.
- In the logs I see `SnapshotConfirmed` just fine but in the test the assertion fails...
- Switched things around since I noticed after the deposit `getSnapshotUTxO` returns the correct utxo but I can't get to `TxValid`.
- My problem seems to be that after a deposit I can't get to see `TxValid` again even if I try to re-spend the Head UTxO (so not confirmed snapshot utxo).
- I think I have spent enough time trying to reliably setup a test case to reproduce this issue but it seems I am not able to do it.
- This happens (I think) because of async exception happening when waiting on TxValid after a new deposit.