Logbook 2026 H1 - cardano-scaling/hydra GitHub Wiki
February 2026
2026-02-25
SB on decommit race bug
The Bug: Wrong Snapshot Number After DecommitFinalized/CommitFinalized
Root Cause
The Hydra Head state tracks two different snapshot-related pieces of state:
- confirmedSnapshot - The last snapshot that was fully confirmed (all signatures collected, finalized)
- seenSnapshot - Tracks the current snapshot activity (what's being requested/signed)
These two can get out of sync when chain events arrive at different times than off-chain consensus completes.
The Scenario That Triggers The Bug
Let me walk through a concrete example:
Initial State: confirmedSnapshot.number = 0 seenSnapshot = LastSeenSnapshot{lastSeen = 0}
Step 1: Alice initiates a decommit ReqDec arrives → decommit stored in state
Step 2: Leader requests snapshot 1 with the decommit NetworkEffect: ReqSn(version=0, number=1, decommit=...) State change: SnapshotRequestDecided{snapshotNumber=1}
After aggregate: confirmedSnapshot.number = 0 (unchanged) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}
Step 3: All parties start signing snapshot 1 Party receives ReqSn → sends AckSn
After processing ReqSn: confirmedSnapshot.number = 0 (still unchanged) seenSnapshot = SeenSnapshot{snapshot.number=1, signatures=...}
Step 4: Snapshot 1 confirms (all AckSn received) After SnapshotConfirmed: confirmedSnapshot.number = 1 ✅ Updated! seenSnapshot = LastSeenSnapshot{lastSeen=1}
Step 5: DecrementTx posted to chain Effect: PostTxOnChain DecrementTx (posts the decommit transaction to L1)
Step 6: 🔥 THE RACE CONDITION 🔥
This is where the bug happens! There are two possible orderings:
Ordering A (Normal - No Bug):
- All AckSn messages arrive → SnapshotConfirmed
- DecommitFinalized observed on chain
Final state: confirmedSnapshot.number = 1 seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Everything in sync
Ordering B (Race - BUG!):
- DecommitFinalized observed on chain BEFORE AckSn completes
- Chain event arrives while still in RequestedSnapshot or SeenSnapshot state
Current state when DecommitFinalized arrives: confirmedSnapshot.number = 0 (snapshot not confirmed yet!) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}
What happens in the DecommitFinalized aggregate:
DecommitFinalized{chainState, newVersion} -> coordinatedHeadState = coordinatedHeadState { decommitTx = Nothing , version = newVersion , seenSnapshot = toLastSeenSnapshot (seenSnapshot coordinatedHeadState) -- 👆 This is the key function }
The toLastSeenSnapshot function (exists in BOTH master and this branch):
toLastSeenSnapshot :: SeenSnapshot tx -> SeenSnapshot tx toLastSeenSnapshot = \case RequestedSnapshot{requested} -> LastSeenSnapshot{lastSeen = requested} -- Uses 'requested', not 'lastSeen'! -- 👆 So requested=1 becomes lastSeen=1 SeenSnapshot{snapshot = Snapshot{number}} -> LastSeenSnapshot{lastSeen = number} -- ... other cases
After DecommitFinalized aggregate: confirmedSnapshot.number = 0 ❌ Still at old value! seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Updated to snapshot that was being processed
🔥 seenSnapshot is AHEAD of confirmedSnapshot! 🔥
Step 7: New L2 transaction arrives
An L2 transaction arrives and triggers onOpenNetworkReqTx:
MASTER CODE (BUGGY):
onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (confirmedSn + 1) -- 👆 Uses confirmedSn only! -- = 0 + 1 = 1
maybeRequestSnapshot nextSn outcome = if not (snapshotInFlight seenSnapshot nextSn) && isLeader ... then -- Emit ReqSn for snapshot 1 cause (NetworkEffect $ ReqSn version 1 ...)
Step 8: ReqSn 1 is broadcast and received
When the ReqSn arrives back at the node, it goes through validation in onOpenNetworkReqSn:
onOpenNetworkReqSn ... = requireReqSn $ \continue -> -- Validation checks where seenSn = seenSnapshotNumber seenSnapshot -- = 1
requireReqSn continue
| sn /= seenSn + 1 =
Error $ RequireFailed $ ReqSnNumberInvalid{requestedSn = sn, lastSeenSn = seenSn}
-- 👆 1 /= 1 + 1 → 1 /= 2 → TRUE → ERROR!
💥 BUG MANIFESTS:
- Trying to request snapshot 1
- But seenSn = 1 (from LastSeenSnapshot{lastSeen=1})
- Validation expects sn = seenSn + 1 = 2
- Error: ReqSnNumberInvalid{requestedSn=1, lastSeenSn=1}
- The head can't request new snapshots → STUCK!
The Fix
THIS BRANCH CODE (FIXED):
onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (max confirmedSn (latestSeenSnapshotNumber seenSnapshot) + 1) -- 👆 NEW: Use max of both!
New helper function: latestSeenSnapshotNumber :: SeenSnapshot tx -> SnapshotNumber latestSeenSnapshotNumber = \case NoSeenSnapshot -> 0 LastSeenSnapshot{lastSeen} -> lastSeen RequestedSnapshot{lastSeen} -> lastSeen -- Use lastSeen (confirmed), not requested! SeenSnapshot{snapshot = Snapshot{number}} -> number - 1 -- Snapshot N is being signed, confirmed is N-1
With the fix: confirmedSn = 0 seenSn = latestSeenSnapshotNumber (LastSeenSnapshot{lastSeen=1}) = 1
nextSn = max(0, 1) + 1 = 2 ✅ CORRECT!
Step 9: ReqSn 2 is broadcast and validated requireReqSn: sn = 2 seenSn = 1 Check: 2 /= 1 + 1 → 2 /= 2 → FALSE ✅ Validation passes!
Why This Fix Works
The fix works because it correctly handles all possible states: ┌──────────────────────────────┬─────────────┬────────┬─────────────────┬─────────────────┬───────────┐ │ Scenario │ confirmedSn │ seenSn │ nextSn (master) │ nextSn (fixed) │ Result │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal flow │ 0 │ 0 │ 0+1=1 ✅ │ max(0,0)+1=1 ✅ │ Both work │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ After DecommitFinalized race │ 0 │ 1 │ 0+1=1 ❌ │ max(0,1)+1=2 ✅ │ Fixed! │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal after confirm │ 1 │ 1 │ 1+1=2 ✅ │ max(1,1)+1=2 ✅ │ Both work │ └──────────────────────────────┴─────────────┴────────┴─────────────────┴─────────────────┴───────────┘ Related Changes
The same fix was applied everywhere snapshot numbers are calculated:
- onOpenNetworkReqTx (L2 transactions)
- onOpenNetworkAckSn (after snapshot confirms, request next if leader + has txs)
- onOpenChainTick (periodic snapshot requests for deposits)
Summary
The bug: After DecommitFinalized/CommitFinalized arrives via chain before off-chain snapshot consensus completes, seenSnapshot gets ahead of confirmedSnapshot, causing wrong snapshot number calculation.
The fix: Use max(confirmedSn, latestSeenSnapshotNumber(seenSnapshot)) + 1 to always calculate the correct next snapshot number regardless of which state is ahead.
Why it matters: Under heavy L2 load, snapshots complete rapidly and chain events can easily arrive out-of-order, making this race condition common rather than rare. The Bug: Wrong Snapshot Number After DecommitFinalized/CommitFinalized
Root Cause
The Hydra Head state tracks two different snapshot-related pieces of state:
- confirmedSnapshot - The last snapshot that was fully confirmed (all signatures collected, finalized)
- seenSnapshot - Tracks the current snapshot activity (what's being requested/signed)
These two can get out of sync when chain events arrive at different times than off-chain consensus completes.
The Scenario That Triggers The Bug
Let me walk through a concrete example:
Initial State: confirmedSnapshot.number = 0 seenSnapshot = LastSeenSnapshot{lastSeen = 0}
Step 1: Alice initiates a decommit ReqDec arrives → decommit stored in state
Step 2: Leader requests snapshot 1 with the decommit NetworkEffect: ReqSn(version=0, number=1, decommit=...) State change: SnapshotRequestDecided{snapshotNumber=1}
After aggregate: confirmedSnapshot.number = 0 (unchanged) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}
Step 3: All parties start signing snapshot 1 Party receives ReqSn → sends AckSn
After processing ReqSn: confirmedSnapshot.number = 0 (still unchanged) seenSnapshot = SeenSnapshot{snapshot.number=1, signatures=...}
Step 4: Snapshot 1 confirms (all AckSn received) After SnapshotConfirmed: confirmedSnapshot.number = 1 ✅ Updated! seenSnapshot = LastSeenSnapshot{lastSeen=1}
Step 5: DecrementTx posted to chain Effect: PostTxOnChain DecrementTx (posts the decommit transaction to L1)
Step 6: 🔥 THE RACE CONDITION 🔥
This is where the bug happens! There are two possible orderings:
Ordering A (Normal - No Bug):
- All AckSn messages arrive → SnapshotConfirmed
- DecommitFinalized observed on chain
Final state: confirmedSnapshot.number = 1 seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Everything in sync
Ordering B (Race - BUG!):
- DecommitFinalized observed on chain BEFORE AckSn completes
- Chain event arrives while still in RequestedSnapshot or SeenSnapshot state
Current state when DecommitFinalized arrives: confirmedSnapshot.number = 0 (snapshot not confirmed yet!) seenSnapshot = RequestedSnapshot{lastSeen=0, requested=1}
What happens in the DecommitFinalized aggregate:
DecommitFinalized{chainState, newVersion} -> coordinatedHeadState = coordinatedHeadState { decommitTx = Nothing , version = newVersion , seenSnapshot = toLastSeenSnapshot (seenSnapshot coordinatedHeadState) -- 👆 This is the key function }
The toLastSeenSnapshot function (exists in BOTH master and this branch):
toLastSeenSnapshot :: SeenSnapshot tx -> SeenSnapshot tx toLastSeenSnapshot = \case RequestedSnapshot{requested} -> LastSeenSnapshot{lastSeen = requested} -- Uses 'requested', not 'lastSeen'! -- 👆 So requested=1 becomes lastSeen=1 SeenSnapshot{snapshot = Snapshot{number}} -> LastSeenSnapshot{lastSeen = number} -- ... other cases
After DecommitFinalized aggregate: confirmedSnapshot.number = 0 ❌ Still at old value! seenSnapshot = LastSeenSnapshot{lastSeen=1} ✅ Updated to snapshot that was being processed
🔥 seenSnapshot is AHEAD of confirmedSnapshot! 🔥
Step 7: New L2 transaction arrives
An L2 transaction arrives and triggers onOpenNetworkReqTx:
MASTER CODE (BUGGY):
onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (confirmedSn + 1) -- 👆 Uses confirmedSn only! -- = 0 + 1 = 1
maybeRequestSnapshot nextSn outcome = if not (snapshotInFlight seenSnapshot nextSn) && isLeader ... then -- Emit ReqSn for snapshot 1 cause (NetworkEffect $ ReqSn version 1 ...)
Step 8: ReqSn 1 is broadcast and received
When the ReqSn arrives back at the node, it goes through validation in onOpenNetworkReqSn:
onOpenNetworkReqSn ... = requireReqSn $ \continue -> -- Validation checks where seenSn = seenSnapshotNumber seenSnapshot -- = 1
requireReqSn continue
| sn /= seenSn + 1 =
Error $ RequireFailed $ ReqSnNumberInvalid{requestedSn = sn, lastSeenSn = seenSn}
-- 👆 1 /= 1 + 1 → 1 /= 2 → TRUE → ERROR!
💥 BUG MANIFESTS:
- Trying to request snapshot 1
- But seenSn = 1 (from LastSeenSnapshot{lastSeen=1})
- Validation expects sn = seenSn + 1 = 2
- Error: ReqSnNumberInvalid{requestedSn=1, lastSeenSn=1}
- The head can't request new snapshots → STUCK!
The Fix
THIS BRANCH CODE (FIXED):
onOpenNetworkReqTx ... = waitApplyTx $ \newLocalUTxO -> newState TransactionAppliedToLocalUTxO{...} & maybeRequestSnapshot (max confirmedSn (latestSeenSnapshotNumber seenSnapshot) + 1) -- 👆 NEW: Use max of both!
New helper function: latestSeenSnapshotNumber :: SeenSnapshot tx -> SnapshotNumber latestSeenSnapshotNumber = \case NoSeenSnapshot -> 0 LastSeenSnapshot{lastSeen} -> lastSeen RequestedSnapshot{lastSeen} -> lastSeen -- Use lastSeen (confirmed), not requested! SeenSnapshot{snapshot = Snapshot{number}} -> number - 1 -- Snapshot N is being signed, confirmed is N-1
With the fix: confirmedSn = 0 seenSn = latestSeenSnapshotNumber (LastSeenSnapshot{lastSeen=1}) = 1
nextSn = max(0, 1) + 1 = 2 ✅ CORRECT!
Step 9: ReqSn 2 is broadcast and validated requireReqSn: sn = 2 seenSn = 1 Check: 2 /= 1 + 1 → 2 /= 2 → FALSE ✅ Validation passes!
Why This Fix Works
The fix works because it correctly handles all possible states: ┌──────────────────────────────┬─────────────┬────────┬─────────────────┬─────────────────┬───────────┐ │ Scenario │ confirmedSn │ seenSn │ nextSn (master) │ nextSn (fixed) │ Result │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal flow │ 0 │ 0 │ 0+1=1 ✅ │ max(0,0)+1=1 ✅ │ Both work │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ After DecommitFinalized race │ 0 │ 1 │ 0+1=1 ❌ │ max(0,1)+1=2 ✅ │ Fixed! │ ├──────────────────────────────┼─────────────┼────────┼─────────────────┼─────────────────┼───────────┤ │ Normal after confirm │ 1 │ 1 │ 1+1=2 ✅ │ max(1,1)+1=2 ✅ │ Both work │ └──────────────────────────────┴─────────────┴────────┴─────────────────┴─────────────────┴───────────┘ Related Changes
The same fix was applied everywhere snapshot numbers are calculated:
- onOpenNetworkReqTx (L2 transactions)
- onOpenNetworkAckSn (after snapshot confirms, request next if leader + has txs)
- onOpenChainTick (periodic snapshot requests for deposits)
Summary
The bug: After DecommitFinalized/CommitFinalized arrives via chain before off-chain snapshot consensus completes, seenSnapshot gets ahead of confirmedSnapshot, causing wrong snapshot number calculation.
The fix: Use max(confirmedSn, latestSeenSnapshotNumber(seenSnapshot)) + 1 to always calculate the correct next snapshot number regardless of which state is ahead.
Why it matters: Under heavy L2 load, snapshots complete rapidly and chain events can easily arrive out-of-order, making this race condition common rather than rare.
2026-02-12
SB on Decommit Stuck Head Analysis (DeltaDeFi bug)
Bug Summary
Under heavy L2 transaction load, a decommit can permanently stuck the Hydra Head due to a race condition between the off-chain snapshot protocol and on-chain decrement observation.
Timeline (from hydra-node-alice-logs.txt)
| Timestamp | Log Line | Event | Key Data |
|---|---|---|---|
| 10:09:47.157 | 247117 | AckSn received → Snapshot confirmed | Effects: DecrementTx (on-chain) + ReqSn(sn=2790, v=3) |
| 10:09:47.412 | 247133 | Outcome processed | SnapshotConfirmed + DecommitApproved(txId=1a87c616...) + SnapshotRequestDecided(sn=2790) |
| 10:09:47.919 | 247188 | ReqSn sent on network | sn=2790, version=3, includes decommitTx + 8 L2 txIds |
| 10:09:47.919 | 247205 | DecommitFinalized chain event | newVersion=4 (bumped from 3) |
| 10:09:48.165 | 247253 | ReqSn received from network | sn=2790, version=3 |
| 10:09:48.165 | 247254 | ERROR: ReqSvNumberInvalid | requestedSv=3, lastSeenSv=4 |
| 10:09:48+ | 247265+ | Only TransactionAppliedToLocalUTxO events | No more snapshots — HEAD STUCK |
Root Cause
The Race
-
onOpenNetworkAckSn(HeadLogic.hs:607) confirms a snapshot. This triggersmaybeRequestNextSnapshot(line 687) which createsSnapshotRequestDecided{sn=2790}and sendsReqSn(v=3, sn=2790). -
Almost simultaneously, the
DecommitFinalizedchain event is processed. Theaggregatefunction (HeadLogic.hs:2003-2016) updatesversionfrom 3 to 4 and clearsdecommitTx. -
When
ReqSn(v=3, sn=2790)is received (even by the sender itself),onOpenNetworkReqSn(HeadLogic.hs:461) checkssv /= version→3 /= 4→ rejects withReqSvNumberInvalid. -
The error is discarded with no state change. But
seenSnapshotremainsRequestedSnapshot{lastSeen=2789, requested=2790}.
Why No Recovery
snapshotInFlight(HeadLogic.hs:360-364) returnsTrueforRequestedSnapshotstatemaybeRequestSnapshot(line 331) andmaybeRequestNextSnapshot(line 687) both checknot snapshotInFlight- Since
seenSnapshotis stuck inRequestedSnapshot, no new snapshot can ever be requested - L2 transactions keep arriving but pile up without ever being snapshotted
The Missing Piece
DecommitFinalized aggregate (HeadLogic.hs:2003-2016) bumps version but does NOT reset seenSnapshot:
DecommitFinalized{chainState, newVersion} ->
case st of
Open os@OpenState{coordinatedHeadState} ->
Open os
{ chainState
, coordinatedHeadState =
coordinatedHeadState
{ decommitTx = Nothing
, version = newVersion
-- BUG: seenSnapshot is NOT reset here
}
}
_otherState -> st
The same issue exists in CommitFinalized (HeadLogic.hs:1753-1781) for the increment path.
Fix
Reset seenSnapshot to LastSeenSnapshot when DecommitFinalized or CommitFinalized bump the version:
DecommitFinalized{chainState, newVersion} ->
case st of
Open os@OpenState{coordinatedHeadState} ->
Open os
{ chainState
, coordinatedHeadState =
coordinatedHeadState
{ decommitTx = Nothing
, version = newVersion
, seenSnapshot =
LastSeenSnapshot
{ lastSeen = seenSnapshotNumber (seenSnapshot coordinatedHeadState)
}
}
}
_otherState -> st
Why This Is Safe
- Old AckSn messages: Dropped by
waitOnSeenSnapshot(noop ifsn <= lastSeen) or put in wait. Old signatures won't verify against a new-version snapshot anyway. - All parties see chain events: DecommitFinalized is observed by all nodes simultaneously, so all reset their state.
- Automatic recovery: Under load, the next
ReqTxtriggersmaybeRequestSnapshotwhich now seessnapshotInFlight = Falseand re-requests with updated version.
2026-02-06
SB on DeltaDeFi bugs
Snapshot Stuck Analysis: SnapshotConfirmed stops appearing
Summary
After a certain point in the Alice node logs, SnapshotConfirmed events stop
appearing entirely. The node enters an infinite loop of ReqSn followed by
WaitOnDepositObserved, processing ~18,000 retries with no progress. The root
cause is a stale currentDepositTxId in CoordinatedHeadState that references
a deposit already recovered on-chain.
Timeline
| Time | Line | Event | Detail |
|---|---|---|---|
| 06:39 | 10625 | OnDepositTx |
Deposit 5655641e observed on-chain |
| 06:39 | 10626 | DepositRecorded |
Added to pendingDeposits |
| 07:14 | 11087 | DepositActivated |
Became active, triggered ReqSn sn=21 deposit=5655641e |
| 07:14 | 11121 | CommitApproved |
Snapshot sn=21 confirmed with this deposit |
| 07:47 | 11874 | DepositExpired |
Deposit expired |
| 07:49-08:17 | 11944-12673 | PostTxOnChainFailed |
Multiple failed RecoverTx attempts |
| 08:23 | 12863 | OnRecoverTx |
Deposit recovered on-chain |
| 08:23 | 12864 | DepositRecovered |
Removed from pendingDeposits |
| 08:28 | 12911 | ReqSn sn=22 |
Still references deposit=5655641e -- stuck begins |
| 08:37 | 29453 | SideLoadSnapshot |
State reset via side-load |
| 08:37 | 29456 | LocalStateCleared |
Clears localTxs, allTxs, seenSnapshot only |
| 09:45 | 50047 | SideLoadSnapshot |
Another side-load |
| 09:45 | 50050 | LocalStateCleared |
Same partial clear |
| 09:46 | 50062 | SnapshotConfirmed |
Last ever SnapshotConfirmed (from side-loaded replay) |
| 09:46 | 50089 | SnapshotRequestDecided sn=22 |
Node decides to request snapshot again |
| 09:46 | 50094 | ReqSn sn=22 deposit=5655641e |
Still references the recovered deposit |
| 09:46+ | 50097+ | WaitOnDepositObserved forever |
Deposit not in pendingDeposits, infinite retry loop (18,003 times) |
Deposit Lifecycle Comparison
Three deposits were referenced in ReqSn messages during this session:
| Deposit | Recorded | Activated | CommitApproved | OnIncrementTx | CommitFinalized | Recovered |
|---|---|---|---|---|---|---|
866fdf31 |
04:34 | yes | yes | 05:08 (v=1) | 05:09 | - |
b34d4dbb |
05:14 | yes | yes | 05:48 (v=3) | 05:49 | - |
5655641e |
06:39 | 07:14 | 07:14 | never | never | 08:23 |
Deposit 5655641e was approved but the IncrementTx was never observed
on-chain, so CommitFinalized never fired. The normal path to clear
currentDepositTxId (via CommitFinalized setting it to Nothing) never
executed. Meanwhile the deposit was recovered on-chain, removing it from
pendingDeposits but leaving the stale reference in
CoordinatedHeadState.currentDepositTxId.
Why OnIncrementTx Was Never Observed: Chain Rollback
The key question is: if CommitApproved happened (meaning the snapshot was
confirmed with the deposit), why was the IncrementTx never observed on-chain?
The detailed sequence around the deposit reveals the answer:
| Time | Line | Event | Detail |
|---|---|---|---|
| 07:14:13.482 | 11103 | ReqSn sn=21 |
Snapshot request with deposit=5655641e |
| 07:14:13.503 | 11104 | AckSn sn=21 |
Acknowledged |
| 07:14:13.565 | 11109 | SnapshotRequested |
Snapshot seen |
| 07:14:13.606 | 11116 | IncrementTx |
IncrementTx posted on-chain |
| 07:14:13.631 | 11119 | PartySignedSnapshot |
Signature collected |
| 07:14:13.643 | 11121 | SnapshotConfirmed |
Snapshot sn=21 confirmed |
| 07:14:13.716 | 11131 | CommitApproved |
Commit approved, more IncrementTx postings follow |
| 07:14:16.751 | 11142 | SnapshotConfirmed |
Last confirmed snapshot in this window |
| ... | (ticks) | Normal tick processing, deposit still active | |
| 07:20:58.651 | 11272 | Rollback |
Chain rolled back -- IncrementTx observation erased |
| 07:20:58.653 | 11273 | ChainRolledBack |
Rollback applied to head state |
| ... | (ticks) | No more SnapshotConfirmed or OnIncrementTx after this |
|
| 07:47:26.378 | 11874 | DepositExpired |
Deposit expires before increment re-lands |
The IncrementTx was posted and initially appeared on-chain, but the chain rollback at 07:20 erased it. After the rollback:
OnIncrementTxwas never re-observed for this deposit- The node does not re-post the
IncrementTxafter a rollback -- it only posts it once as a side-effect ofSnapshotConfirmedviamaybePostIncrementTxinonOpenNetworkAckSn CommitFinalizednever happened, socurrentDepositTxIdwas never cleared- The deposit expired (07:47) and was eventually recovered on-chain (08:23)
- But
currentDepositTxIdremained stale, poisoning all futureReqSnmessages
Root Cause
Three issues in HeadLogic.hs:
1. LocalStateCleared does not reset currentDepositTxId or decommitTx
In the aggregate function (lines 1835-1857), the handler for LocalStateCleared resets localUTxO, localTxs, allTxs, and seenSnapshot, but leaves currentDepositTxId and decommitTx untouched:
LocalStateCleared{snapshotNumber} ->
case st of
Open os@OpenState{coordinatedHeadState = coordinatedHeadState@CoordinatedHeadState{confirmedSnapshot}} ->
Open
os
{ coordinatedHeadState =
case confirmedSnapshot of
InitialSnapshot{initialUTxO} ->
coordinatedHeadState
{ localUTxO = initialUTxO
, localTxs = mempty
, allTxs = mempty
, seenSnapshot = NoSeenSnapshot
}
ConfirmedSnapshot{snapshot = Snapshot{utxo}} ->
coordinatedHeadState
{ localUTxO = utxo
, localTxs = mempty
, allTxs = mempty
, seenSnapshot = LastSeenSnapshot snapshotNumber
}
-- NOTE: currentDepositTxId and decommitTx are NOT cleared here
}
After a SideLoadSnapshot triggers LocalStateCleared, the stale
currentDepositTxId persists. The next time the node decides to request a
snapshot (via onOpenNetworkReqTx -> maybeRequestSnapshot at line 337, or
onOpenChainTick at line 987), it includes this stale deposit ID in the
ReqSn.
2. No validation that currentDepositTxId exists in pendingDeposits when
building ReqSn
In maybeRequestSnapshot (line 337) and onOpenChainTick (line 987), the
currentDepositTxId from CoordinatedHeadState is included in the outgoing
ReqSn without verifying the deposit still exists in pendingDeposits:
maxTxsPerSnapshot localTxs') decommitTx currentDepositTxId) ```
When the receiving side processes this `ReqSn` in `onOpenNetworkReqSn` ->
`waitForDeposit` (line 489), it looks up the deposit in `pendingDeposits`:
```haskell case Map.lookup depositTxId pendingDeposits of Nothing -> wait
WaitOnDepositObserved{depositTxId} ```
Since the deposit was recovered (removed from `pendingDeposits`), this returns
`WaitOnDepositObserved` on every retry, creating an infinite loop.
## Log Statistics
| Event | Count | Notes |
| ----------------------- | ------ | ---------------------------------- |
| `ReqSn` | 18,075 | Dominated by retries of sn=22 |
| `WaitOnDepositObserved` | 18,003 | All for deposit `5655641e` |
| `SnapshotConfirmed` | 59 | Last one at line 50062 |
| `DepositExpired` | 1,800 | Deposit re-evaluated on every tick |
| `DepositActivated` | 385 | Before expiration |
| `PostTxOnChainFailed` | 36 | Failed `RecoverTx` attempts |
### 3. No re-posting of `IncrementTx` after chain rollback
In `onOpenNetworkAckSn`, the `IncrementTx` is posted on-chain as a one-shot
side-effect of `SnapshotConfirmed` via `maybePostIncrementTx` (line 692). If a
chain rollback erases the `IncrementTx` observation, there is no mechanism to
re-post it. The `handleChainInput` handler for `Open` state + `Rollback` only
calls `ChainRolledBack` to update the chain state -- it does not check whether
a pending `IncrementTx` needs to be resubmitted.
This is the triggering cause in this incident: the rollback at 07:20 erased the
`IncrementTx`, leaving the node in a state where `CommitApproved` had happened
but `CommitFinalized` never would.
## Fixes
1. **Re-post `IncrementTx` after chain rollback** -- when a rollback is
observed and there is a `currentDepositTxId` set with a confirmed snapshot
containing `utxoToCommit`, the node should re-post the `IncrementTx`. This is
the primary fix that would have prevented the incident.
2. **Clear `currentDepositTxId` and `decommitTx` in `LocalStateCleared`
aggregate handler** -- this prevents stale references from surviving a
side-load snapshot. Defense-in-depth.
3. **Clear `currentDepositTxId` when `DepositRecovered` is processed** -- in
`aggregateNodeState`, the `DepositRecovered` handler removes the deposit
from `pendingDeposits` but does not clear `currentDepositTxId` in the
`CoordinatedHeadState` if it matches. This would break the infinite loop as a
fallback.
4. **Validate `currentDepositTxId` against `pendingDeposits` before including
in `ReqSn`** -- if the deposit no longer exists, set it to `Nothing` in the
outgoing request. Another layer of defense.
## 2026-02-04
## 2026-02-04
### SB on DeltaDeFi reported bugs
- Ok after some hurdles I finally got some useful logs.
- Seems the problem timeline is something like this: Deposit gets expired, in
the code we do not remove it from pendingDeposits, deposit was recovered, then
SideLoadSnapshot occurs in order to unstuck the head but , Node waits forever
in order to see expected deposit since there is a ReqSn with depositTxId which
we already recovered.
- After some time the node stops waiting on deposit and user does a new tx
which leads to ReqTx but no ReqSn.
- And the old ReqSn is still in flight with the recovered depositTxId.
- I found one line in code which is problematic - we insert into pending
deposits on DepositExpired. We should keep track of recoverable deposits
separately I think to avoid this problem.
- Off chain code/head logic is so convoluted and hard to follow when user
problems like this occur. I wish we could keep it more simple.
- I think one thing could be not to keep pending deposits so that we can
recover. If the user hits the api to recover some deposit we should not keep it
in pending deposits but try to recover using provieded txid. If the deposit was
already spend than they get an error and that is it - pending deposits are
reserved only for ones that are still pending so not expired.
- Right now I need to write a test that exposes this problem (needs to be on
top of 1.2.0 since that's what DeltaDeFi are using).
- Did this in BehaviorSpec where I issued a NewTx after a DepositExpired and
saw the problem where multiple ReqSn messages are sent with populated
depositTxId.
- Applied fix that Delta-DeFi came up with to remove depositTx from the
coordinated head state after seeing a deposit expired - still no green because
we have a snapshot in flight that should be discarded now when the deposit is
expired.
- How to actually discard a snapshot that was already signed by participants? I
think something similar to SnapshotSideLoad should work out.
- Since it is tricky to capture all of the logs in BehaviorSpec I committed
just the failing test and want to reproduce the same error in e2e tests as
well. There I should be able to see more logging that is relevant.
- It might be tricky to write a test for DepositExpired - I see there are none
so far :sad-face
- We need to make sure we can run our e2e tests on different networks in order
to make sure all is working in the real world as well.
- It is hard to reproduce this bug on e2e. I need to have
`SnapshotRequestDecided` but on chain Tick (I think) since we also emit the
same message on AckSn handling. All of this is triggered on DepositActivated it
seems so very hard to come to the same setup (need to think about it).
- So there is ReqSn and AckSn in the logs for snapshot number 21 which contains
depositTxId. Then CommitApproved (for another deposit) so it seems there are
multiple deposits flying around.
- This will be very hard to reproduce since I need to have a snapshot in flight
when DepositExpired message is seen.
- Perhaps I could add some code for the node to ignore certain snapshot in
order to reproduce this?
- Seems like it is not that easy...
- Jackal mention they do IC -> ID -> IC -> ID -> IC fails so prehaps tomorrow I
try that order (and try to run it on preprod since devnet is soo fast)
- Oh another idea, slow devnet up by fiddling with the configuration?
## 2026-02-03
## SB on Time on L2
- We have two users reporting bugs with time validity on L2 transactions.
- One user might be just affected by this because of the offline Hydra node
while another one is claiming the slots on L2 are not correct therefore his
contract fails while doing L2 tx.
- I wanted to include aiken vesting contract (since thats what one of the users
used) into our testing toolchain but running into problems when trying to
commit the script UTxO into a Head:
ScriptFailedInWallet {redeemerPtr = "ConwaySpending (AsIx {unAsIx = 1})", failureReason = "ValidationFailure (WrapExUnits {unWrapExUnits = ExUnits' {exUnitsMem' = 0, exUnitsSteps' = 0}}) (CekError An error has occurred:
The machine terminated because of an error, either from a built-in function or from an explicit use of 'error'.
Caused by: unBData
(Constr 0
[ Constr 0
[ List
[ Constr 0
[ Constr 0
[ B #44d2f08229825ba3a3e260ecd2dcc9ac0af4da6462bcf5f435d0b0ccca90dbce
, I 1 ]
, Constr 0
[ Constr 0
[ Constr 1
[ B #c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045 ]
, Constr 1 [] ]
, Map
[ (B "))
- This is how the commit tx looks like before going into the wallet:
"0ac5466e63ef9527dfff23f485c3d1a10cc10d04f3e3f38275a2676602b9e18b"
== INPUTS (2)
- 44d2f08229825ba3a3e260ecd2dcc9ac0af4da6462bcf5f435d0b0ccca90dbce#1 ShelleyAddress Testnet (ScriptHashObj (ScriptHash "c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045")) StakeRefNull 1293000 lovelace 1 d0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab.f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d TxOutDatumInline "0xd0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab" ReferenceScriptNone
- 60d7f7796929f70379696147009478a6df0f1439d39b8bd2a8f5ea4df84a0101#0 ShelleyAddress Testnet (ScriptHashObj (ScriptHash "e8b53932e4a43630bed3893f186fe9e8a8391f45d951af082a726e4b")) StakeRefNull 7000000 lovelace TxOutDatumInline [0,[100,"0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8","0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8"]] ReferenceScriptNone
== COLLATERAL INPUTS (0)
== REFERENCE INPUTS (1)
- df4967ec1d8358b1a4372d73880e85ed0cfa5d10428849fe93c276fa629ea781#0 ShelleyAddress Testnet (ScriptHashObj (ScriptHash "6a09cb22defaf4a96a6be1ef6c07467ac9923d1750a79214a06c503a")) StakeRefNull 12352460 lovelace TxOutDatumNone ReferenceScript PlutusScriptLanguage PlutusScriptV3 "c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045"
== OUTPUTS (1) Total number of assets: 2
- ShelleyAddress Testnet (ScriptHashObj (ScriptHash "61458bc2f297fff3cc5df6ac7ab57cefd87763b0b7bd722146a1035c")) StakeRefNull 8293000 lovelace 1 d0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab.f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d TxOutDatumInline [0,["0xd5bf4a3fcce717b0388bcc2749ebc148ad9969b23f45ee1b605fd58778576ac4",[0,[[0,"0x60d7f7796929f70379696147009478a6df0f1439d39b8bd2a8f5ea4df84a0101",0,"0xd8799fd8799fd87a9f581ce8b53932e4a43630bed3893f186fe9e8a8391f45d951af082a726e4bffd87a80ffa140a1401a006acfc0d87b9fd8799f1864581c69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8581c69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8ffffd87a80ff"]]],"0xd0786d92892d904ae16c775e85648c6cb669bd053bfed39c746c06ab"]]
== TOTAL COLLATERAL TxTotalCollateralNone
== RETURN COLLATERAL TxReturnCollateralNone
== FEE TxFeeExplicit ShelleyBasedEraConway (Coin 0)
== VALIDITY TxValidityNoLowerBound TxValidityUpperBound ShelleyBasedEraConway Nothing
== MINT/BURN 0 lovelace
== SCRIPTS (1) Total size (bytes): 2622
- Script (ScriptHash "e8b53932e4a43630bed3893f186fe9e8a8391f45d951af082a726e4b")
== DATUMS (1)
- "d1120c6cb2453a0cc4b6fc6dd19f6bea76b779a3399450560aff181fe5f658bd" [0,[100,"0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8","0x69830961c6af9095b0f2648dff31fa9545d8f0b6623db865eb78fde8"]]
== REDEEMERS (2)
- ConwaySpending (AsIx {unAsIx = 0}) ( cpu = 0, mem = 0 ) [1,[[0,"0x60d7f7796929f70379696147009478a6df0f1439d39b8bd2a8f5ea4df84a0101",0]]]
- ConwaySpending (AsIx {unAsIx = 1}) ( cpu = 0, mem = 0 ) [0,[]]
== REQUIRED SIGNERS
- "f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d"
== METADATA TxMetadataInEra ShelleyBasedEraConway (TxMetadata {unTxMetadata = fromList [(55555,TxMetaText "HydraV1/CommitTx")]})
- So it looks like commit tx was assembled and then when we want to balance it
there is an error with deconstructing the Data for redeemer pointer 1, this
one:
- ConwaySpending (AsIx {unAsIx = 1}) ( cpu = 0, mem = 0 ) [0,[]]
- I made sure the vesting validator returns true always but perhaps there is
some bug in constructing the commit and mangling the redeemers which is what we
do there.
- What is weird I get unBData errors by using datum hashes and also inline
datums.
- I'll try quickly to replace the aiken datum with something simple and see if
that solves the issue. I don't think it should since I am already able to
create a vesting output and now when I want to commit it to Hydra all of this
happens. Nope.
- Leaving this for now since I don't want to waste time doing this when we
don't know for sure time reporting has a bug. Revisit later if need be.
# January 2026
## 2026-01-27
### SB on tx validity ranges on L2
- We have a user that fails to submit L2 tx because of errors in the tx validity (they are spending the script utxo on L2 and the script checks the validity range of a tx)
- We also have TinyCat report similar issue which was resolved by providing an endpoint which returns the head initialization time. Than he uses this info in order to calculate
the tx validity which indicates some sort of bug in Hydra.
- I'll try to follow the code path and see what happens when we receive a `NewTx` client input:
on NewTx calls onOpenClientNewTx onOpenClientNewTx is called from handleClientInput handleClientInput is called from updateSyncedHead updateSyncedHead is called from update where the current slot we get from NodeInSync message
So it seems like when we emit NodeInSync message the reported slot is somehow depending on the head initialization.
- I can't tell just from looking at the code where the bug is. It is probably better to try and write a test that exercises script utxo spending and observe the
slot behavior.
### SB on unstable incremental commit processing
- I should look at this one as the most urgent thing https://github.com/cardano-scaling/hydra/issues/2446
- This happens on some stress tests when there are many `NewTx's` sent out while doing a incremental commit in between. Then version mismatch prevents new snapshots from being created and the Head is stuck.
- We bump the version upon observing increment tx and if there is a `ReqTx` before this version bump and a `ReqSn` right after then the snapshot version is not in line and the head is waiting forever to create a new snapshot.
- We can of course fix this but the real question is what would be the most elegant and secure fix?
- I think I should reproduce this first locally with the instructions NS provided which are:
nix run .#demo commit on alice, nothing on bob, carol run hydra-txn-respender for alice increment on bob waitobserve picked up commit head stalled; mismatch around requestedSnapshot
- I thought I could write a test in the BehaviourSpec but we don't get to control network messages there. It would be useful to have a test suite capable of saying _when you see ReqTx do this_.
- Instead I wrote a e2e test that spamms the L2 with txs and when doing incremental commit my hope is to see a stuck snapshot.
- Indeed we don't see any confirmed snapshot after `CommitFinalized`. I'll check the logs to see if spamming of txs continues after the deposit so that the next snapshot should be produced.
- Seems like spamming doesn't work, I see only one NewTx. I would need to chain txs for this to work out.
- There is a function `respendUTxO` I could use.
- I see...the function waits for `SnapshotConfirmed` which is not what I want.
- I am able to spam hydra-node with `NewTx` messages but what I also want it to see some of them after the deposit.
- Perhaps I need to do a incremental commit concurrently?
- Trying this I still don't see a valid tx after the increment, probably need to find a way to update the utxo I am trying to spend. What happens if I use different keys for new tx and deposit? Surely I can spend part of the head UTxO? Hmm...I think when applying the txs
we take into account the complete UTxO so the next tx after deposit is invalid if I am spending just part of it?
- I will try to catch the exception that comes from waiting on `TxValid` and retry to see if I get the updated snapshot utxo info to spend it.
- I still can't get to what I want which is to keep sending bunch of txs and in the middle do just one deposit and assert we get to a snapshot with version 1.
- Using `forkIO` to run two actions in a backgroung threads produces better results. Test fails while waiting to see a snapshot with version 1 now but I see `TxValid` messages after the deposit was made.
- In the logs I see `SnapshotConfirmed` just fine but in the test the assertion fails...
- Switched things around since I noticed after the deposit `getSnapshotUTxO` returns the correct utxo but I can't get to `TxValid`.
- My problem seems to be that after a deposit I can't get to see `TxValid` again even if I try to re-spend the Head UTxO (so not confirmed snapshot utxo).
- I think I have spent enough time trying to reliably setup a test case to reproduce this issue but it seems I am not able to do it.
- This happens (I think) because of async exception happening when waiting on TxValid after a new deposit.