Logbook 2025 H1 - cardano-scaling/hydra GitHub Wiki

April 2025

2025-04-29

SB on hydra-node using blockfrost

  • I am at point where I wrote a test in the DirectChainSpec to open, close and fanout a head using blockfrost.
  • The test is not entirely using blockfrost since we also rely on cardano-node on preview for querying UTxO for example.
  • The problem is that sometimes faucet UTxO comes back empty which is weird since I see at least one UTxO with large amount of ada but also some NFT's.
  • Why queryUTxO fails to return the correct UTxO? I feel like this problem is not part of the thing I am trying to solve but I definitely need a green test otherwise I can't know if the chain following logic for blockfrost works.
  • Maybe I should test this in e2e scenario instead and see how things behave in there? Is it possible that cardano-node is not in sync completely before test is ran?
  • One thing I noticed - I am using blockfrost with faucet key to publish scripts at the beginning but I am not awaiting for the local cardano-node to see these transactions!
  • Then when I query the faucet UTxO I either see empty UTxO or get BadInputs error so I think I need to await for script publishing transactions definitely.
  • This is far from optimal - perhaps I need to create equivalent functions that work with blockfrost api instead?
  • After re-mapping all needed functions into blockfrost versions and not use local cardano-node for anything, I still get BadInputs error..hmm. At least I see the correct UTxO picked up so I'll work my way from there. This is probably some logic in the UTxO seed...
  • After adding blockfrost equivalent for awaitForTransaction I am still at the same place - makes sense. The produced output is not a problem but the actual transaction.
  • I pretty printed the faucet utxo and the tx and I don't see anything weird.
  • Important note is - we get a valid tx when building but the Blockfrost returns an error when submitting!
  • Decided to find just a singe utxo that is responsible from seeding from a faucet just to reduce a clutter but the error is the same:
Faucet UTxO: 815e52d1ee#0 ↦ 54829439 lovelace
62fb023528#1 ↦ 9261435223 lovelace
70e6c21881#1 ↦ 129433058019 lovelace + 1 13d1f7feab83ff4db444bf96b8677949c5bf9c709671f30ff8f33ab3.487964726120446f6f6d202d2033726420506c6163652054726f706879 + 1 19c98d04cdb6e1e782a73e693697d4a46ca9820d5d490a3bf6470a07.487964726120446f6f6d202d20326e6420506c6163652054726f706879 + 1 1a22028742629f3cf38b3d1036a088fea59eb30237a675420fb25c11.2331 + 1 6d92350897706b14832c62c5b5644e918f0b6b3b63ffc00a1a463828.487964726120446f6f6d202d2031737420506c6163652054726f706879 + 1 ad39d849181dc206488fd726240c00b55547153ffdca8c079e1e34d9.487964726120446f6f6d202d2031737420506c6163652054726f706879 + 1 bfe4ab531fd625ef33ea355fd85953eb944bffa401af767666ff411c.487964726120446f6f6d202d2031737420506c6163652054726f706879 + 1 c953682b6eb5891c0bda35718c5261587d57e5e408079cbeb8cf881a.2331 + 1 cd6076d9d0098da4c7670c08f230e4efe31d666263c9db5196805d6e.487964726120446f6f6d202d2031737420506c6163652054726f706879 + 1 d0c91707d75011026193c0fce742443dde66fa790936981ece5d9f8b.2331 + 69918000000 d8906ca5c7ba124a0407a32dab37b2c82b13b3dcd9111e42940dcea4.0014df105553444d + 1 dd7e36888a487f8b27687f65abd93e6825b4eb3ce592ee5f504862df.487964726120446f6f6d202d2031737420506c6163652054726f706879 + 1 fa10c5203512eeeb92bf79547b09f5cdb2e008689864b0175cca6fee.487964726120446f6f6d202d2034746820506c6163652054726f706879
Found UTxO: 62fb023528#1 ↦ 9261435223 lovelace
"f99907e0b4e3c9d554a68e76c3a72b4090cffb5c12d0cd471e29e1d0fa7184d2"

== INPUTS (1)
- cd62585298998cd809f6fe08a4af3087dab8f73ed67132b8c8fd4162fb023528#1
      ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "9783be7d3c54f11377966dfabc9284cd6c32fca1cd42ef0a4f1cc45b"})) StakeRefNull
      9261435223 lovelace
      TxOutDatumNone
      ReferenceScriptNone

== COLLATERAL INPUTS (0)

== REFERENCE INPUTS (0)

== OUTPUTS (2)
Total number of assets: 1
- ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d"})) StakeRefNull
      100000000 lovelace
      TxOutDatumNone
- ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "9783be7d3c54f11377966dfabc9284cd6c32fca1cd42ef0a4f1cc45b"})) StakeRefNull
      9161262858 lovelace
      TxOutDatumNone

== TOTAL COLLATERAL
TxTotalCollateralNone

== RETURN COLLATERAL
TxReturnCollateralNone

== FEE
TxFeeExplicit ShelleyBasedEraConway (Coin 172365)

== VALIDITY
TxValidityNoLowerBound
TxValidityUpperBound ShelleyBasedEraConway Nothing

== MINT/BURN
0 lovelace

== SCRIPTS (0)
Total size (bytes):  0

== DATUMS (0)

== REDEEMERS (0)

== REQUIRED SIGNERS
[]

== METADATA
TxMetadataNone

  can open, close & fanout a Head using Blockfrost [✘]

Failures:

  test/Test/DirectChainSpec.hs:385:3:
  1) Test.DirectChain can open, close & fanout a Head using Blockfrost
       uncaught exception: APIBlockfrostError
       BlockfrostError "BlockfrostBadRequest \"{\\\"contents\\\":{\\\"contents\\\":{\\\"contents\\\":{\\\"era\\\":\\\"ShelleyBasedEraConway\\\",\\\"error\\\":[\\\"ConwayUtxowFailure (UtxoFailure (ValueNotConservedUTxO (MaryValue (Coin 0) (MultiAsset (fromList []))) (MaryValue (Coin 9261435223) (MultiAsset (fromList [])))))\\\",\\\"ConwayUtxowFailure (UtxoFailure (BadInputsUTxO (fromList [TxIn (TxId {unTxId = SafeHash \\\\\\\"cd62585298998cd809f6fe08a4af3087dab8f73ed67132b8c8fd4162fb023528\\\\\\\"}) (TxIx {unTxIx = 1})])))\\\"],\\\"kind\\\":\\\"ShelleyTxValidationError\\\"},\\\"tag\\\":\\\"TxValidationErrorInCardanoMode\\\"},\\\"tag\\\":\\\"TxCmdTxSubmitValidationError\\\"},\\\"tag\\\":\\\"TxSubmitFail\\\"}\""

  • I don't see anything wrong with the tx but blockfrost seems to thing the input is invalid for whatever reason. I'll explore tx endpoints to see if I can find something useful.
  • Checked the mapping between blockfrost/cardano when creating UTxO and all looks good.
  • Added awaitForTransaction - blockfrost variant which didn't help out with the error.
  • Made all functions in Blockfrost.Client work in BlockfrostClientT IO so I can run them all from outside (I suspected that calling multiple blockfrost connections can cause problems) and this didn't help either.
  • I think all these changes I did are good to keep but I still don't see why submitting of a seeding tx fails?

2025-04-23

SN on fixing deposits

  • How to deal with incompatible deposits? We do observe them, but should the head logic track them?
  • When introducing a currentTime to the open state (in order to determine deadline being good or not) I realize that Tick contents would be useful to have on the Observation chain event, which would be easily possible. How is the contestation deadline done?
    • Aha! The need for tracking a currentTime : UTCTime in the HeadState can be worked around by tracking all deposits and discard them on tick.
    • Hm.. but that would move the decision whether to snapshot a pending deposit only to the Tick handling. Which means that it may only happen on the next block..
    • But this is where the deposit tracking needs to go anyways .. we will never issue a snapshot directly when observing the deposit (that's why we are here) and if we decouple the issuance from the observation, the logic needs to go to the Tick handling anyways!
  • When changing CommitRecorded I stumble over newLocalUTxO = localUTxO <> deposited .. why would we want to update our local ledger already when recording the deposit!? This was likely a bug too..
  • Specification will be quite different than what we need to implement: there are no deposits tracked and only a wait for any previous pending deposits. To what level do we need to specify the logic of delaying deposits and checking deadlines?
  • Why was the increment snapshotting only waiting for an "unresolved Decommit" before requesting a snapshot?
  • Why do we need to wait at all (for other decommits or commits) if there is no snapshot in flight and we are the leader.. why not just snapshot what we want?
  • After moving the incremental commit snapshot decision to Tick handling, the model fails because of a NewTx can't spend a UTxO added through a Deposit before -> interesting!
  • After bringing back a Uα equivalent to the HeadLogic the model spec consistently finds an empty utxoToCommit which fails to submit an incrementTx -> good!
  • Interestingly, the model allows to do action $ Deposit {headIdVar = var2, utxoToDeposit = [], deadline = 1864-06-16 04:36:38.606749385646 UTC} which obviously results in an empty utxo to commit.. this can happen in the wild too!
    • Unclear where exactly we want to deal with empty deposits.
  • Back to where we started with a very old Deposit and the node trying to do an increment with deadline already passed. This should be easy to fix by just not trying to snapshot it. However, what if a dishonest hydra-node would do just that? Would we approve that snapshot? Certainly the on-chain script would forbid it, but this could stall the head.
    • This is similar to the empty utxo thing. While we can make our honest hydra-node do funky stuff, we must ensure that we do not sign snapshots that are funky!
    • Which tests would best capture this? The ModelSpec won't see these issues once our honest implementation stops requesting funky snapshots!
  • To determine whether a deposit is still (or already) fine, we are back to needing a notion of UTCTime when doing that decision? We could do that updating in the Tick handling and keep information about a deposit being Outdated or so. Then, the snapshot acknowledgment code can tell whether a deposit is valid and only sign if it is.
    • Tracking a full Deposit type in pendingDeposits which has a DepositStatus.
    • With the new Deposit type I can easily mark deposits as Expired and need to fix several behavior tests to put realistic deadlines. However, the observability in tests is lacking and I definitely need a DepositExpired server output to fix all tests.

2025-04-22

SN on fixing deposits

  • Deposit fixes: How to test this situation? I need a test suite that includes the off-chain logic, but also allows control over rollbacks and spending inputs.
    • Model based tests are not including incremental commits :(
    • TxTraceSpec contains deposit/increment, but does only exercise the L1 related code
    • The behavior tests do cover deposit/increment behavior, but deposit observations are only injected! So rollbacks would not cover them.
  • Lets bite the bullet.. at least the model-based MockChain could be easily adapated to do deposits in simulateCommit?
  • Ran into the same issue as we had on CI when shrinking was failingon partial !. Guarding the shrinkAction to only include actions if their party is still in the seed seems to fix this.. but now shrinking does not terminate?
    • Detour on improving shrinking and counterexamples of that checkModel problem .. shifting back to fixing deposits.
  • After adding Deposit actions, implementing a simulateDeposit and adjusting some generators/preconditions, I consistently run into test failures with deadline <- arbitrary. This is already interesting! The hydra-node seems to still try to increment deposits with very far in the past (year 1864) deadlines -> first bug found and reproducible!

2025-04-09

SB on blockfrost wallet queries

  • After using blockfrost query to get all eras and try to construct EraHistory I was surprised to discover that using nonEmptyFromList fails.

  • I know for sure that I am not constructing empty list here so this is confusing.

  • Fond the example in the atlas repo https://atlas-app.io/ but those were also failing which is even more surprising.

  • When looking at the blockfrost query results I noticed there are multiple NetworkEraSummary that start and end with slot 0 which is surprising:

eras: [ NetworkEraSummary
    { _networkEraStart = NetworkEraBound
        { _boundEpoch = Epoch 0
        , _boundSlot = Slot 0
        , _boundTime = 0s
        }
    , _networkEraEnd = NetworkEraBound
        { _boundEpoch = Epoch 0
        , _boundSlot = Slot 0
        , _boundTime = 0s
        }
    , _networkEraParameters = NetworkEraParameters
        { _parametersEpochLength = EpochLength 4320
        , _parametersSlotLength = 20s
        , _parametersSafeZone = 864
        }
    }
, NetworkEraSummary
    { _networkEraStart = NetworkEraBound
        { _boundEpoch = Epoch 0
        , _boundSlot = Slot 0
        , _boundTime = 0s
        }
    , _networkEraEnd = NetworkEraBound
        { _boundEpoch = Epoch 0
        , _boundSlot = Slot 0
        , _boundTime = 0s
        }
    , _networkEraParameters = NetworkEraParameters
        { _parametersEpochLength = EpochLength 86400
        , _parametersSlotLength = 1s
        , _parametersSafeZone = 25920
        }
    }

  • After removing them I can parse EraHistory with success but the question is how to filter out values from blockfrost? Which are valid eras?

  • I'll try filtering all eras that start and end with slot 0

  • This worked - I reported what I found to the blockfrost guys

  • Now it is time to move forward and test if the wallet queries actually work

  • I picked one DirectChainTest and decided to alter it so it runs on preview using withCardanoNodeOnKnownNetwork but I get

  test/Test/DirectChainSpec.hs:124:3:
  1) Test.DirectChain can init and abort a 2-parties head after one party has committed
       uncaught exception: QueryException
       QueryProtocolParamsEncodingFailureOnEra (AnyCardanoEra AlonzoEra) "Error in $: key \"poolVotingThresholds\" not found"
  • It seems like re-mapping the protocol params from blockfrost fails on poolVotingThresholds.

  • This happens immediately when cardano-node reports MsgSocketIsReady

cardano-node --version
cardano-node 10.1.4 - linux-x86_64 - ghc-8.10
git rev 1f63dbf2ab39e0b32bf6901dc203866d3e37de08

  • I can see that this field exists in the conway-genesis.json in the tmp folder of a test run

SB on finalizing recover/decrement observations should not be conditional

  • After PR review comments from FT I wanted to add one suggestion and that is to see the Head closed and finalized after initially committing and then decommitting some UTxO.

  • This leads to H28 error on close and this means we tried to close with initial snapshot but in fact we already got the confirmed snapshot.

  • When inspecting the logs I found out that the node, after a restart, does not observe any SnapshotConfirmed and therefore tries to close with initial one which fails.

  • Question is: Why did the restarted node failed to re-observe confirmed snapshot event?

  • Added some test code to wait and see SnapshotConfirmed in the restarted node to confirm it actually sees this event happening and the test fails exactly at this point.

  • When both nodes are running I can view the snapshot confirmed message is there but after a restart - node fails to see SnapshotConfirmed message again.

  • In the logs for both node 1 and 2 before restart I see two SnapshotConfirmed messages but in the restarted node these events are gone.

  • I realized the close works if I close from node that was not restarted but what I want to do is wait for the restarted node to catch up and then close.

  • I removed fiddling with the recover and wanted to get this basic test working but closing with restarted node, even after re-observing the last decommit, fails with H28 FailedCloseInitial.

  • This means the restarted node tried to close with the initial snapshot but one of the values doesn't match. We expect the version to be 0, snapshot number to be 0 and utxo hash should match the initial one.

  • last-known-revision for both nodes before I shutdown one of them is 11 but the restarted node, after removing the last-known-revision file ends up having value 13. How come it received more messages?

  • When comparing the state files I see discrepancies in eventId and the restarted node has a DecommitRecorded as the last event (other than ticks)

  • Regular node decommit recorded:

{"eventId":44,"stateChanged":{"decommitTx":{"cborHex":"84a300d9010281825820ad7458781dc19e427fca77c8c7b2db1b56c81c11590e2ae3999f2f13db8c51c200018182581d60f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d1a004c4b400200a100d9010281825820eb94e8236e2099357fa499bfbc415968691573f25ec77435b7949f5fdfaa5da0584071b6c5956083ff7ac7ad49d5a75c77967b5ad2e7fd756c1de226f71cdf89e5d383bc88975c9ca7deab135f4ea9014666aa0e257f26bdd94dda2df60c922e9306f5f6","description":"","txId":"3095040e42ed9b193f8a66699b1631c17a85f670aee3c4d77fb3cfb195ea6bcb","type":"Tx ConwayEra"},"headId":"654b2b0e5ff3e0a902a12918b63628cdd478364caa4f0c758e6f7490","newLocalUTxO":{},"tag":"DecommitRecorded","utxoToDecommit":{"3095040e42ed9b193f8a66699b1631c17a85f670aee3c4d77fb3cfb195ea6bcb#0":{"address":"addr_test1vru2drx33ev6dt8gfq245r5k0tmy7ngqe79va69de9dxkrg09c7d3","datum":null,"datumhash":null,"inlineDatum":null,"inlineDatumRaw":null,"referenceScript":null,"value":{"lovelace":5000000}}}},"time":"2025-04-10T07:30:58.882632162Z"}
  • Restarted node decommit recorded
{"eventId":76,"stateChanged":{"decommitTx":{"cborHex":"84a300d9010281825820ad7458781dc19e427fca77c8c7b2db1b56c81c11590e2ae3999f2f13db8c51c200018182581d60f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d1a004c4b400200a100d9010281825820eb94e8236e2099357fa499bfbc415968691573f25ec77435b7949f5fdfaa5da0584071b6c5956083ff7ac7ad49d5a75c77967b5ad2e7fd756c1de226f71cdf89e5d383bc88975c9ca7deab135f4ea9014666aa0e257f26bdd94dda2df60c922e9306f5f6","description":"","txId":"3095040e42ed9b193f8a66699b1631c17a85f670aee3c4d77fb3cfb195ea6bcb","type":"Tx ConwayEra"},"headId":"654b2b0e5ff3e0a902a12918b63628cdd478364caa4f0c758e6f7490","newLocalUTxO":{},"tag":"DecommitRecorded","utxoToDecommit":{"3095040e42ed9b193f8a66699b1631c17a85f670aee3c4d77fb3cfb195ea6bcb#0":{"address":"addr_test1vru2drx33ev6dt8gfq245r5k0tmy7ngqe79va69de9dxkrg09c7d3","datum":null,"datumhash":null,"inlineDatum":null,"inlineDatumRaw":null,"referenceScript":null,"value":{"lovelace":5000000}}}},"time":"2025-04-10T07:31:02.301798566Z"} 
  • Let's try to see the decommit timeline between two states (I am aware these event's do not need to be in order but I think etcd should deliver in order after restart)

  • So let's track this decommit between two nodes


DecommitRecorded 
running node        2025-04-10T07:30:58.882632162Z
restarted node      2025-04-10T07:31:02.301798566Z

DecommitApproved 
running node        2025-04-10T07:30:58.894604418Z
restarted node      missing event

DecommitFinalized  
running node        2025-04-10T07:30:59.007515339Z
restarted node      2025-04-10T07:31:02.300503374Z

  • So it seems like the restarted node is late couple of seconds but how can it be that in the test we wait to see DecommitFinalized and if we try to close after the restarted node still thinks it is at version 0?

2025-04-04

SN on exploring dingo

  • Trying out dingo and whether I could hook it up to hydra-node

  • When synchronizing preview with dingo the memory footprint was growing as sync progressed, but did not increase to same level when restarting the chain sync (althoug it picked up the starting slot etc.)

  • The system was swapping a lot of memory too (probably reached max of my 32GB)

  • Querying address of latest hydra head address shows two heads on preview, but our explorer only shows one?

  • Querying the dingo node seems to work, but I get a hydra scripts discovery error?

    MissingScript {scriptName = "\957Initial", scriptHash = "c8a101a5c8ac4816b0dceb59ce31fc2258e387de828f02961d2f2045", discoveredScripts = fromList ["0e35115a2c7c13c68ecd8d74e4987c04d4539e337643be20bb3274bd"]}
    
  • Indeed dingo behaves slightly different on the queryUTxOByTxIn local state query: when requesting three txins, it only responds with one utxo

    [ TxIn "b7b88533de303beefae2d8bb93fe1a1cd5e4fa3c4439c8198c83addfe79ecbdc" ( TxIx 0 ) , TxIn "da1cc0eef366031e96323b6620f57bc166cf743c74ce76b6c3a02c8f634a7d20" ( TxIx 0 ) , TxIn "6665f1dfdf9b9eb72a0dd6bb73e9e15567e188132b011e7cf6914c39907ac484" ( TxIx 0 ) ] returned utxo: 1
    
  • After fixing that to query three times, the next stop gap seems to come from chain sync:

    bearer closed: "<socket: 23> closed when reading data, waiting on next header True"
    
  • Maybe something on the n2c handshake does not work? On dingo side I see:

    {"time":"2025-04-05T13:47:05.495636842+02:00","level":"INFO","msg":"listener: accepted connection from unix@629","component":"connmanager"} {"time":"2025-04-05T13:47:05.4957064+02:00","level":"ERROR","msg":"listener: failed to setup connection: could not register protocol with muxer","component":"connmanager"}
    
  • When debugging how far we get on the handshake protocol I learn how gouroboros implements the state transitions of the miniprotocols using StateMap.

  • I realize that now the query for scripts not even works.. maybe my instrumentation broke something? Also.. all my instrumentation happened on vendored code in vendor/ of the dingo repo. I wonder how developers do the editing most convenient in this setup?

  • The Chain.Direct switch to connectToLocalNodeWithVersions was problematic, now it fetches the scripts correctly and the chain sync starts

  • It's definitely flaky in how "far" we get.. maybe the dingo node is only accepting n2c connections while connected upstream on n2n (I have been in a train with flaky connection).

  • Once it progressed now onto a RollForward where the queryTimeHandle would query the EraHistory and fail time conversion with error:

    TimeConversionException {slotNo = SlotNo 77202345, reason = "PastHorizon {pastHorizonCallStack = [(\"runQuery\",SrcLoc {srcLocPackage = \"ouroboros-consensus-0.22.0.0-f90d7bc7c4431d706016c293a932800b9c1e28c3b268597acc5b945a9be83125\", srcLocModule = \"Ouroboros.Consensus.HardFork.History.Qry\", srcLocFile = \"src/ouroboros-consensus/Ouroboros/Consensus/HardFork/History/Qry.hs\", srcLocStartLine = 439, srcLocStartCol = 44, srcLocEndLine = 439, srcLocEndCol = 52}),(\"interpretQuery\",SrcLoc {srcLocPackage = \"hydra-node-0.21.0-inplace\", srcLocModule = \"Hydra.Chain.Direct.TimeHandle\", srcLocFile = \"src/Hydra/Chain/Direct/TimeHandle.hs\", srcLocStartLine = 91, srcLocStartCol = 10, srcLocEndLine = 91, srcLocEndCol = 24}),(\"slotToUTCTime\",SrcLoc {srcLocPackage = \"hydra-node-0.21.0-inplace\", srcLocModule = \"Hydra.Chain.Direct.TimeHandle\", srcLocFile = \"src/Hydra/Chain/Direct/TimeHandle.hs\", srcLocStartLine = 86, srcLocStartCol = 7, srcLocEndLine = 86, srcLocEndCol = 20}),(\"mkTimeHandle\",SrcLoc {srcLocPackage = \"hydra-node-0.21.0-inplace\", srcLocModule = \"Hydra.Chain.Direct.TimeHandle\", srcLocFile = \"src/Hydra/Chain/Direct/TimeHandle.hs\", srcLocStartLine = 116, srcLocStartCol = 10, srcLocEndLine = 116, srcLocEndCol = 22})], pastHorizonExpression = Some (EPair (ERelToAbsTime (ERelSlotToTime (EAbsToRelSlot (ELit (SlotNo 77202345))))) (ESlotLength (ELit (SlotNo 77202345)))), pastHorizonSummary = [EraSummary {eraStart = Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}), eraParams = EraParams {eraEpochSize = EpochSize 4320, eraSlotLength = SlotLength 20s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}},EraSummary {eraStart = Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}), eraParams = EraParams {eraEpochSize = EpochSize 86400, eraSlotLength = SlotLength 1s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}},EraSummary {eraStart = Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}), eraParams = EraParams {eraEpochSize = EpochSize 86400, eraSlotLength = SlotLength 1s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}},EraSummary {eraStart = Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 0, boundEpoch = EpochNo 0}), eraParams = EraParams {eraEpochSize = EpochSize 86400, eraSlotLength = SlotLength 1s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}},EraSummary {eraStart = Bound {boundTime = RelativeTime 0s, boundSlot = SlotNo 172800, boundEpoch = EpochNo 2}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 259200s, boundSlot = SlotNo 86400, boundEpoch = EpochNo 1}), eraParams = EraParams {eraEpochSize = EpochSize 86400, eraSlotLength = SlotLength 1s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}},EraSummary {eraStart = Bound {boundTime = RelativeTime 259200s, boundSlot = SlotNo 55728000, boundEpoch = EpochNo 645}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 55814400s, boundSlot = SlotNo 345600, boundEpoch = EpochNo 4}), eraParams = EraParams {eraEpochSize = EpochSize 86400, eraSlotLength = SlotLength 1s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}},EraSummary {eraStart = Bound {boundTime = RelativeTime 55814400s, boundSlot = SlotNo 77155200, boundEpoch = EpochNo 893}, eraEnd = EraEnd (Bound {boundTime = RelativeTime 77241600s, boundSlot = SlotNo 55900800, boundEpoch = EpochNo 647}), eraParams = EraParams {eraEpochSize = EpochSize 86400, eraSlotLength = SlotLength 1s, eraSafeZone = StandardSafeZone 0, eraGenesisWin = GenesisWindow {unGenesisWindow = 0}}}]}"}
    
  • I saw that same error when using cardano-cli query tip .. seems like the era history local state query is not accurately reporting epoch bounds.

  • I conclude that dingo is easy to use and navigate around, but the N2C API is not complete yet. Maybe my work on the LocalStateQuery API in cardano-blueprint could benefit the project and making gouroboros more conformant (at least from a message serialization point of view).

March 2025

2025-03-28

SN on memory leak

  • Non-profiled Haskell binaries can be inspected using -s and -hT RTS arguments

  • Running the hydra-node using a 2GB state file as provided by GD the node will load the state and then fail on mismatched keys (as we have not the right ones):

     151,712,666,608 bytes allocated in the heap
      14,411,335,656 bytes copied during GC
         973,747,296 bytes maximum residency (53 sample(s))
          24,460,192 bytes maximum slop
                2033 MiB total memory in use (0 MiB lost due to fragmentation)
    
  • The peekForeverE in https://github.com/cardano-scaling/hydra/pull/1919 seem not to make any difference:

     151,712,692,632 bytes allocated in the heap
      14,409,258,352 bytes copied during GC
         973,732,032 bytes maximum residency (53 sample(s))
          24,545,088 bytes maximum slop
                2033 MiB total memory in use (0 MiB lost due to fragmentation)
    
  • Using hT a linear growth of memory can be seen quite easily.

  • First idea: lastEventId conduit was using foldMapC which might be building thunks via mappend

    • Nope, that was not the issue.
  • That was not the issue.. disabling aggregation of chainStateHistory and only load headState next.

    • Still linear growth.. so the culprit most likely is inside the main loading of headState (besides other issues?)
  • Let's turn on StrictData on all of HeadLogic as a first stab at seeing more stricture usage of HeadState et al.

  • This works! Making HeadLogic{.State, .Outcome} all StrictData already pushes the heap usage down ~5MB!

  • Possible explanation: With gigabytes of state updates we have almost exclusively TransactionReceived et al state changes. In the aggregate we usually build up thunks like allTxs = allTxs <> fromList [(txId tx, tx)] which will leak memory until forced into one concrete list when showing the HeadState first (which will probably collapse the memory usage again).

  • With StrictData we have a maximum residency of 10MB after loading 2GB of state events:

     152,176,815,256 bytes allocated in the heap
      16,702,572,088 bytes copied during GC
           9,967,848 bytes maximum residency (2387 sample(s))
             215,600 bytes maximum slop
                  43 MiB total memory in use (0 MiB lost due to fragmentation)
    
  • Trying to narrow in exact source of memory leak so I do not need to put bangs everywhere

    • allTxs and localTxs assignments are not the source of it .. maybe the coordinatedHeadState record update?

    • No .. also not really. Maybe it's time to recompile with profiling enabled and make some coffee (this will take a while).

    • When using profiling: True using the haskell.nix managed dependencies, I ran into this error:

    • Setting enableProfiling = true in the haskell.nix project modules rebuilds the whole world, but that is expected.

    • Hard to spot where exactly we are creating the space leak / thunks. This blog post is helpful still: http://blog.ezyang.com/2011/06/pinpointing-space-leaks-in-big-programs/

    • I am a bit confused why so many of the cost center point to parsing and decoding code .. maybe the transactions themselves (which make up the majority of data) are not forced for long? This would make sense because the HeadLogic does not inspect transactions themselves (much).

    • Only strictness annotations on a !tx did not help, but let's try a StrictData on StateChanged

    • StrictData on HeadLogic.Outcome does not fix it … so it must be something related to the HeadState.

    • The retainer profile actually points quite clearly to aggregate. heap-retainers

    • The biggest things on the heap are bytes, thunks and types related to a cardano transaction body. hydra-node

  • Going back to zero in on branches of aggregate via exclusion

    • Disabling all CoordinatedHeadState modifications makes memory usage minimal again
    • Enabling SnapshotConfirmed -> still bounded
    • Enabling PartySignedSnapshot -> still bounded
    • Enabling SnapshotRequested -> growing!
    • Without allTxs update -> bounded!
    • This line creates thunks!? allTxs = foldr Map.delete allTxs requestedTxIds
    • Neither, forcing allTxs nor requestedTxIds helped
    • Is it really only this line? enableing all other aggregate updates to CoordinatedHeadState
    • It's both allTxs usages
    • If we only make allTxs field strict? -> Bounded!

2025-03-27

SB on fanout utxo bug

  • After easy changes to FanoutTx to include observed UTxO instead of using the confirmed snapshot there are problems in the DirectChainSpec and Model.

  • Let's look at DirectChainSpec first - I need to come up with a utxo value for this line here:

aliceChain `observesInTime` OnFanoutTx headId mempty
  • Failed test looks like this:
  test/Test/DirectChainSpec.hs:578:35:
  1) Test.DirectChain can open, close & fanout a Head
       expected: OnFanoutTx {headId = UnsafeHeadId "eK+\SO_\243\224\169\STX\161)\CAN\182\&6(\205\212x6L\170O\fu\142ot\144", fanoutUTxO = fromList [(TxIn "0762c8de902abe1e292e691066328c932d95e29c9a564d466e8bc791527e359f" (TxIx 0),TxOut (AddressInEra (ShelleyAddressInEra ShelleyBasedEraConway) (ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "8163bc1d679f90d073784efdc761288dbc2dc21a352f69238070fc45"})) StakeRefNull)) (TxOutValueShelleyBased ShelleyBasedEraConway (MaryValue (Coin 2000000) (MultiAsset (fromList [])))) TxOutDatumNone ReferenceScriptNone),(TxIn "c9a733c945fdb7819648a58d7d6b9a30af2ac458a27f5bb7e9c41f92da82ba2c" (TxIx 0),TxOut (AddressInEra (ShelleyAddressInEra ShelleyBasedEraConway) (ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "8163bc1d679f90d073784efdc761288dbc2dc21a352f69238070fc45"})) StakeRefNull)) (TxOutValueShelleyBased ShelleyBasedEraConway (MaryValue (Coin 2000000) (MultiAsset (fromList [])))) TxOutDatumNone ReferenceScriptNone)]}
        but got: OnFanoutTx {headId = UnsafeHeadId "eK+\SO_\243\224\169\STX\161)\CAN\182\&6(\205\212x6L\170O\fu\142ot\144", fanoutUTxO = fromList [(TxIn "880c3d807a48d432788158f879a81a5ddc6c1ad6527fe70922175e621ea08092" (TxIx 0),TxOut (AddressInEra (ShelleyAddressInEra ShelleyBasedEraConway) (ShelleyAddress Testnet (ScriptHashObj (ScriptHash "0e35115a2c7c13c68ecd8d74e4987c04d4539e337643be20bb3274bd")) StakeRefNull)) (TxOutValueShelleyBased ShelleyBasedEraConway (MaryValue (Coin 4879080) (MultiAsset (fromList [(PolicyID {policyID = ScriptHash "654b2b0e5ff3e0a902a12918b63628cdd478364caa4f0c758e6f7490"},fromList [("4879647261486561645631",1),("f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d",1)])])))) (TxOutDatumInline BabbageEraOnwardsConway (HashableScriptData "\216{\159\216y\159X\FSeK+\SO_\243\224\169\STX\161)\CAN\182\&6(\205\212x6L\170O\fu\142ot\144\159X \213\191J?\204\231\ETB\176\&8\139\204'I\235\193H\173\153i\178?E\238\ESC`_\213\135xWj\196\255\216y\159\EM'\DLE\255\NUL\SOHX \193\211\DC4E\234\252\152\157\239\186\RSmVF\141\208\218\135\141\160{\fYFq\245\SOH\148\nOS\DC1X \227\176\196B\152\252\FS\DC4\154\251\244\200\153o\185$'\174A\228d\155\147L\164\149\153\ESCxR\184UX \227\176\196B\152\252\FS\DC4\154\251\244\200\153o\185$'\174A\228d\155\147L\164\149\153\ESCxR\184U\128\ESC\NUL\NUL\SOH\149\214\218\152\136\255\255" (ScriptDataConstructor 2 [ScriptDataConstructor 0 [ScriptDataBytes "eK+\SO_\243\224\169\STX\161)\CAN\182\&6(\205\212x6L\170O\fu\142ot\144",ScriptDataList [ScriptDataBytes "\213\191J?\204\231\ETB\176\&8\139\204'I\235\193H\173\153i\178?E\238\ESC`_\213\135xWj\196"],ScriptDataConstructor 0 [ScriptDataNumber 10000],ScriptDataNumber 0,ScriptDataNumber 1,ScriptDataBytes "\193\211\DC4E\234\252\152\157\239\186\RSmVF\141\208\218\135\141\160{\fYFq\245\SOH\148\nOS\DC1",ScriptDataBytes "\227\176\196B\152\252\FS\DC4\154\251\244\200\153o\185$'\174A\228d\155\147L\164\149\153\ESCxR\184U",ScriptDataBytes "\227\176\196B\152\252\FS\DC4\154\251\244\200\153o\185$'\174A\228d\155\147L\164\149\153\ESCxR\184U",ScriptDataList [],ScriptDataNumber 1743066405000]]))) ReferenceScriptNone)]}

  • So it seems like there is a script output in the observed UTxO with 4879080 lovelace, some tokens and seems like this is a head output and what we expect is distributed outputs to hydra-node parties containing the fanout amount.

  • These head assets that I see should have been burned already? We get this utxo in the observation using let inputUTxO = resolveInputsUTxO utxo tx

  • If I use

  (headInput, headOutput) <- findTxOutByScript inputUTxO Head.validatorScript
  UTxO.singleton (headInput, headOutput)

then the utxo is the same which is expected.

  • How come the fanout tx does not contain pub key outputs?

  • If I use utxoFromTx fanoutTx then I get the expected pub key outputs:

  1) Test.DirectChain can open, close & fanout a Head
       expected: OnFanoutTx {headId = UnsafeHeadId "eK+\SO_\243\224\169\STX\161)\CAN\182\&6(\205\212x6L\170O\fu\142ot\144", fanoutUTxO = fromList []}
        but got: OnFanoutTx {headId = UnsafeHeadId "eK+\SO_\243\224\169\STX\161)\CAN\182\&6(\205\212x6L\170O\fu\142ot\144", fanoutUTxO = fromList [(TxIn "431e45c0048e0aa104deaca1e8aca454c85efd71c52948e418d9119fd8cdf7b3" (TxIx 0),TxOut (AddressInEra (ShelleyAddressInEra ShelleyBasedEraConway) (ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "4e932840c5d2d3664237149fd3e9ba09c531581126fbdbab073c31ce"})) StakeRefNull)) (TxOutValueShelleyBased ShelleyBasedEraConway (MaryValue (Coin 2000000) (MultiAsset (fromList [])))) TxOutDatumNone ReferenceScriptNone),(TxIn "431e45c0048e0aa104deaca1e8aca454c85efd71c52948e418d9119fd8cdf7b3" (TxIx 1),TxOut (AddressInEra (ShelleyAddressInEra ShelleyBasedEraConway) (ShelleyAddress Testnet (KeyHashObj (KeyHash {unKeyHash = "f8a68cd18e59a6ace848155a0e967af64f4d00cf8acee8adc95a6b0d"})) StakeRefNull)) (TxOutValueShelleyBased ShelleyBasedEraConway (MaryValue (Coin 90165992) (MultiAsset (fromList [])))) TxOutDatumNone ReferenceScriptNone)]}

but the overall test is red since we construct artificial TxIns in utxoFromTx

  • I created findPubKeyOutputs to match on all pub key outputs and then I see expected outputs but they also contain change output that returns some ada to the hydra-node wallet. Life is not simple.

  • In the end I changed all tests that match exactly on final utxo to make sure that subset of final utxo is there (disregarding the change output).

  • Changes in fanout observation boiled down to findPubKeyOutputs $ utxoFromTx tx

SB on memory leak on loading events from disk

  • Midnight people have reported that they still see some memory issues when loading a huge state file from disk.

  • The main problem is making sure the fix works, I still don't have a good idea on how to make sure my changes reduce the memory consumption.

  • Problem lies in this piece of code:

 (lastEventId, (headState, chainStateHistory)) <-
    runConduitRes $
      sourceEvents eventSource
        .| getZipSink
          ( (,)
              <$> ZipSink (foldMapC (Last . pure . getEventId))
              <*> ZipSink recoverHeadStateC
          )
...

recoverHeadStateC =
    mapC stateChanged
      .| getZipSink
        ( (,)
            <$> ZipSink (foldlC aggregate initialState)
            <*> ZipSink (foldlC aggregateChainStateHistory $ initHistory initialChainState)
        )

and of course the way we create PersistenceIncremental which is responsible for reading the file (sourceEvents eventSource part).

 sourceFileBS fp
          .| linesUnboundedAsciiC
          .| mapMC
            ( \bs ->
                case Aeson.eitherDecodeStrict' bs of
                  Left e -> ...
                  Right decoded -> ...
            )
  • Initially I noticed the usage of foldlC which is strict and thought perhaps this is the problem but could not find lazy alternative and in general I don't believe this is the real issue.

  • I am more keen to investigate this code:

 sourceFileBS fp
          .| linesUnboundedAsciiC
          .| mapMC ...
  • linesUnboundedAsciiC could be the cause since I believe it is converting the whole stream
Convert a stream of arbitrarily-chunked textual data into a stream of data
where each chunk represents a single line. Note that, if you have
unknownuntrusted input, this function is unsafe/, since it would allow an
attacker to form lines of massive length and exhaust memory.

  • I also found an interesting function peekForeverE that should Run a consuming conduit repeatedly, only stopping when there is no more data available from upstream.

  • Could I use benchmarks to simulate heavy load from disk?

  • I just tried running the benchmarks with and without one line change and it seems like the memory consumption is reduced

    • BEFORE ->
Average confirmation time (ms): 57.599974154
P99: 76.48237684999998ms
P95: 67.55752405ms
P50: 56.9354805ms
Invalid txs: 0

### Memory data

 | Time | Used | Free |
|------|------|------|
 | 2025-03-27 15:39:59.474482067 UTC | 14.2G | 35.1G |
 | 2025-03-27 15:40:04.474412824 UTC | 14.4G | 34.9G |
 | 2025-03-27 15:40:09.474406479 UTC | 14.4G | 34.9G |
 | 2025-03-27 15:40:14.474403701 UTC | 14.4G | 34.8G |
 | 2025-03-27 15:40:19.47445777 UTC | 14.4G | 34.8G |
 | 2025-03-27 15:40:24.474392458 UTC | 14.4G | 34.8G |
 | 2025-03-27 15:40:29.474439923 UTC | 14.4G | 34.8G |
 | 2025-03-27 15:40:34.474408859 UTC | 14.5G | 34.7G |
 | 2025-03-27 15:40:39.474436556 UTC | 14.4G | 34.7G |
 | 2025-03-27 15:40:44.474414945 UTC | 14.5G | 34.7G |

Confirmed txs/Total expected txs: 300/300 (100.00 %)
Average confirmation time (ms): 9.364033643
P99: 19.919154109999997ms
P95: 15.478096ms
P50: 7.7630015ms
Invalid txs: 0

### Memory data

 | Time | Used | Free |
|------|------|------|
 | 2025-03-27 15:40:55.995225272 UTC | 14.2G | 35.1G |
 | 2025-03-27 15:41:00.995294779 UTC | 14.2G | 35.1G |
 | 2025-03-27 15:41:05.995309124 UTC | 14.2G | 35.1G |
 | 2025-03-27 15:41:10.995299687 UTC | 14.3G | 35.0G |
 | 2025-03-27 15:41:15.995284362 UTC | 14.3G | 35.0G |
 | 2025-03-27 15:41:20.995281122 UTC | 14.3G | 35.0G |


  • AFTER ->
Average confirmation time (ms): 57.095020378
P99: 72.8903286ms
P95: 66.89188805ms
P50: 57.172249ms
Invalid txs: 0

### Memory data

 | Time | Used | Free |
|------|------|------|
 | 2025-03-27 15:37:47.726878831 UTC | 13.7G | 35.6G |
 | 2025-03-27 15:37:52.726824668 UTC | 13.9G | 35.5G |
 | 2025-03-27 15:37:57.726768654 UTC | 14.0G | 35.3G |
 | 2025-03-27 15:38:02.72675874 UTC | 14.0G | 35.3G |
 | 2025-03-27 15:38:07.726756126 UTC | 14.0G | 35.3G |
 | 2025-03-27 15:38:12.726795633 UTC | 14.0G | 35.2G |
 | 2025-03-27 15:38:17.726793141 UTC | 14.1G | 35.2G |
 | 2025-03-27 15:38:22.726757309 UTC | 14.1G | 35.1G |
 | 2025-03-27 15:38:27.726764279 UTC | 14.1G | 35.1G |
 | 2025-03-27 15:38:32.726781991 UTC | 14.1G | 35.1G |

Confirmed txs/Total expected txs: 300/300 (100.00 %)
Average confirmation time (ms): 9.418157436
P99: 19.8506584ms
P95: 15.841609050000002ms
P50: 7.821248000000001ms
Invalid txs: 0

### Memory data

 | Time | Used | Free |
|------|------|------|
 | 2025-03-27 15:38:45.195815881 UTC | 13.8G | 35.6G |
 | 2025-03-27 15:38:50.195894922 UTC | 14.0G | 35.4G |
 | 2025-03-27 15:38:55.19592388 UTC | 13.8G | 35.5G |
 | 2025-03-27 15:39:00.195971592 UTC | 14.1G | 35.2G |
 | 2025-03-27 15:39:05.195891924 UTC | 14.3G | 35.0G |
 | 2025-03-27 15:39:10.195897911 UTC | 14.3G | 35.0G |


  • I think I could try to keep the state file after running the benchmarks and then try to start a hydra-node using this (hopefully huge) state file and then peek into prometheus metrics to observe reduced memory usage.

  • What I find weird is that the same persistence functions are used in the api server but there is no report leakage there - perhaps it boils down on how we consume this stream?

  • Managed to get a state file with over 300k events so let's see if we can measure reduced usage.

  • This is my invocation of hydra-node so I can copy paste it when needed:

./result/bin/hydra-node \
  --node-id 1 --api-host 127.0.0.1  \
  --monitoring-port 6000 \
  --hydra-signing-key /home/v0d1ch/code/hydra/memory/state-0/me.sk \
  --hydra-scripts-tx-id "8f46dbf87bd7eb849c62241335fb83b27e9b618ea4d341ffc1b2ad291c2ad416,25f236fa65036617306a0aaf0572ddc1568cee0bc14aee14238b1196243ecddd,59a236ac22eb1aa273c4bcd7849d43baddd8fcbc5c5052f2eb074cdccbe39ff4" \
  --cardano-signing-key /home/v0d1ch/code/hydra/memory/1.sk \
  --ledger-protocol-parameters /home/v0d1ch/code/hydra/memory/state-0/protocol-parameters.json \
  --testnet-magic 42 \
  --contestation-period 10 \
  --deposit-deadline 10 \
  --node-socket /home/v0d1ch/code/hydra/memory/node.socket \
  --persistence-dir /home/v0d1ch/code/hydra/memory/state-0 
 
  • Didn't find the time to properly connect to some tool to measure the memory but by looking at the timestamps between LoadindState and LoadedState traces I can see that new changes give MUCH better performance:
With the current master:

start loading timestamp":"2025-03-27T17:12:08.57862623Z
loaded        timestamp":"2025-03-27T17:12:28.991870713Z

With one-liner change:   

start loading timestamp":"2025-03-27T16:58:54.055623085Z
loaded        timestamp":"2025-03-27T16:59:15.05648201Z
  • It looks like it took us 20 seconds to load around 335 mb state file and new change reduces this to around a second!

2025-03-24

SB on pending deposits bug

  • It seems like we have a bug in displaying our pending deposits where deposits that are already incremented or recovered still show when doing a request to hydra-node api /commits.

  • I extended one e2e test we had related to pending deposits and added one check after all others where I spin up again two hydra-nodes and call the endpoint to see if all pending deposits are cleared.

✦ ➜ cabal test hydra-cluster --test-options='--match="can see pending deposits" --seed 278123554'

  • The test seems flaky but in general it almost always fails.

  • From just looking at the code I couldn't see anything weird

  • Found one weird thing: I asserted that in the node-1 state file there are three CommitRecorded and three CommitRecovered but in node-2 state file there are two CommitRecovered missing.

  • Is the whole bug related to who does the recovering/recording?

  • The test outcome although red shows correct txids for the non-recovered txs

  • We only assert node-1 sees all CommitRecover messages but don't do it for the node-2 since that node is shut down at this point (in order to be able to prevent deposits from kicking in).

  • Is this a non-issue after all? I think so since we stop one node and then try to assert that after restart it sees some other commits being recovered but those were never recorded in the node local state. What is weird is that the test was flaky but using constant seed yields always the same results.

  • If a node fails to see OnIncrementTx then the deposit is stuck in pending local state forever.

2025-03-12

SB on Model tests weird behavior

  • Currently I had to sprinkle treadDelay here and there in the model tests since otherwise they hang for a long time and eventually (I think) report the shrinked values that fail the test.

  • This problem is visible mainly in CI where the resources available are not so big, locally the same tests pass.

  • If I remove the threadDelay the memory grows really big and I need to kill the process.

  • This started happening when I had to replace GetUTxO which no longer exists with queryState

  • I looked at this with NS and found out that we were missing to wait for all nodes to see a DecommitFinalized - we were waiting only on our node to see it. This seemed to have fixed the model test and was a bit surprising to be this easy since I expected a lot of problems in finding out what went wrong.

SB on figuring out what happened in our Head

  • The situation is that we are unable to close because of H13 MustNotChangeVersion

  • This happens because the version in the input datum (open datum) does not match with the version in the output (close datum).

  • Local state says I am on version 3 and onchain it seems the situation is the same - 3! But this can't be since the onchain check would pass then. This is how the datum looks https://preview.cexplorer.io/datum/8e4bd7ac38838098fbf23e5702653df2624bcfa4cf0c5236498deeede1fdca78

  • Looking at the state it seems like we try to close, the snapshot version contains correct version (3) but openVersion is still at 2:

  ...
                  "utxoToCommit": null,
                  "utxoToDecommit": null,
                  "version": 3
                },
                "tag": "ConfirmedSnapshot"
              },
              "headId": "50bb0874ae28515a2cff9c074916ffe05500a3b4eddea4178d1bed0b",
              "headParameters": {
                "contestationPeriod": 300,
                "parties": [
...
              "openVersion": 2,
              "tag": "CloseTx"
            },
            "tag": "OnChainEffect"
          }
		  
  • Question is how did we get to this place? It must be that my node didn't observe and emit one CommmitFinalized which is when we do the version update - upon increment observation.

  • There are 24 lines with CommitFinalize message - only go up to version 2 while there are 36 lines with CommitRecorded - it seems like one recorded commit was not finalized for whatever reason.

  • OnIncrementTx shows up 8 times in the logs but in reality it is tied to only two increments so the third one was never observed.

  • OnDepositTx shows up 12 times in the logs but they are related to only two deposits.

  • Could it be that the decommit failed instead?

  • There is one DecommitRecorded and one DecommitFinalized so it seems good.

  • Seems like we have CommitRecorded for:

    • "utxoToCommit":{"4b31dd7db92bde4359868911c1680ea28c0a38287a4e5b9f3c07086eca1ac26a#0"
    • "utxoToCommit":{"4b31dd7db92bde4359868911c1680ea28c0a38287a4e5b9f3c07086eca1ac26a#1"
    • "utxoToCommit":{"22cb19c790cd09391adf2a68541eb00638b8011593b3867206d2a12a97f4bf0d#0"
  • We received CommitFinalized for:

  • "theDeposit":"44fa1bc9b04d2ffee50fd84088517c3f7b530353834e7c678fdd05073881cb40"

    • "theDeposit":"5b93f95068148482a1e27979517e8ab467f85e72551cfc9baaa2086a60e7353a"
  • So one commit was never finalized but it is a bit hard to connect recorded and finalized commits.

  • OnDepositTx was seen for txids: - 44fa1bc9b04d2ffee50fd84088517c3f7b530353834e7c678fdd05073881cb40 - 5b93f95068148482a1e27979517e8ab467f85e72551cfc9baaa2086a60e7353a - 83e7c36a9d4727e00169409f869d0f94737672c7e87850632b9efe1637f8ef8f

  • OnIncrementTx was seen for:

  • Question is what to do with this Head? Can it be closed somehow?

  • We should query the deposit address to see what kind of UTxOs are available there.

2025-03-10

FT on SideLoad-Snapshot

  • Added an endpoint to GET the latest confirmed snapshot, which is needed to construct the side-load request, but it does not include information about the latest seen snapshot. Waiting on pull#1860 to enhance it.

  • In our scenario, the head got stuck on InitialSnapshot. This means that during side-loading, we must act similarly to clear pending transactions (pull#1840).

  • Wonder if the side-loaded snapshot version should be exactly the same as the current one, given that version bumping requires L1 interaction.

  • Also unclear if we should validate utxoToCommit and utxoToDecommit on the provided snapshot to match the last known state.

  • Concerned that a head can become stuck during a Recover or Decommit client input.

  • SideLoadSnapshot is the first ClientInput that contains a headId and must be verified when received by the node.

  • Uncertain whether WaitOnNotApplicableTx for localTxs not present in the side-loaded confirmed snapshot would trigger automatic re-submission.

  • I think this feature should not be added to TUI since it is not part of the core protocol or user journey.

2025-03-07

FT on SideLoad-Snapshot

  • Now that we have a head stuck on the initial snapshot, I want to explore how we can introspect the node state from the client side, as this will be necessary to create the side-load request.

  • Projecting the latest SnapshotConfirmed seems straightforward, but projecting the latest SeenSnapshot introduces code duplication in HeadLogic.aggregate and the ServerOutput projection.

  • These projections currently conflict heavily with pull#1860. For that reason, we are postponing these changes until it is merged.

2025-03-06

FT on SideLoad-Snapshot

  • We need to break down withHydraNode into several pieces to allow starting a node with incorrect ledger-protocol-params in its running configuration.

  • In this e2e scenario, we exercise a three party network where two nodes (node-1 and node-2) are healthy, and one (node-3) is misconfigured. In this setup, node-1 attemtps to submit a NewTx which is accepted by both healthy members but rejected by node-3. Then, when node-3 goes offline and comes back online using healthy pparams, it is expected to stop cooperating and cause the head to become stuck.

  • It seems that after node-3 comes back online, it only sees a PeerConnected message within 20s. Adding a delay for it to catch up does not help. From its logs, we don’t see messages for WaitOnNotApplicableTx, WaitOnSeenSnapshot, or DroppedFromQueue.

  • If node-3 tries to re-submit the same transaction, it is now accepted by node-3 but rejected by node-1 and node-2 due to ValueNotConservedUTxO (because it was already applied). Since node-3 is not the leader, we don’t see any new SnapshotRequested round being signed.

  • Node-1 and node-2 have already signed and observed each other signing for snapshot number 1, while node-3 has not seen anything. This means node-1 and node-2 are waiting for node-3 to sign in order to proceed. Now the head is stuck and won’t make any progress because node-3 has stopped cooperating.

  • New issue raised for head getting stuck issue#1773, which proposes to forcibly sync the snapshots of the hydra-nodes in order to align local ledger states.

  • Updating the sequence diagram for a head getting stuck using latest findings.

  • Now thinking how could we "Allow introspection of the current snapshot in a particular node"; as we want to be able to notice if the head has become stuck. We want to be able to observed who is missing to sign the current snapshot in flight (which is preventing from getting it confirmed).

  • Noticed that in onOpenNetworkReqTx we keep TransactionReceived even if not applicable, resulting in a list with potentially dupplicate elements (in case of resubmission).

  • Given that a head becoming stuck is an L2 issue due to network connectivity, I’m considering whether we could send more information about the local ledger state as part of PeerConnected to trigger auto-sync recovery based on discrepancies. Or perhaps we should broadcast InvalidTx instead?

  • Valid idea to explore after side-load.

2025-03-05

FT on SideLoad-Snapshot

  • Trying to reproduce a head becoming stuck in BehaviorSpec when a node starts with an invalid ledger.

  • Having oneMonth in BehaviorSpec's waitUntilMatch makes debugging harder. Reduced it to (6 * 24 * 3), allowing full output visibility.

  • After Bob reconnects using a valid ledger, we expected him to accept the transaction if re-submitted by him, but he rejects it instead.

  • It's uncertain whether Bob is rejecting the resubmission or something else, so I need to wait until all transactions are dropped from the queue.

  • Found that when Bob is resubmitting, he is in Idle state when he is expected to restart in Initial state.

  • This is interesting, as if a party suffers a disk error and loses persistence, side-loading may allow it to resume up to a certain point in time.

  • The idea is valid, but we should not accept a side-load when in Idle state—only when in Open state.

  • It seems this is the first time we attempt to restart a node in BehaviorSpec. Now checking if this is the right place or if I should design the scenario differently.

  • When trying to restart the node from existing sources, we noticed the need to use the hydrate function. This suggests we should not force reproducing this scenario in BehaviorSpec.

  • NodeSpec does not seem to be the right place either, as we don't have multiple peers connected to each other.

  • Trying to reproduce the scenario at the E2E level, now running on top of an etcd network.

2025-03-04

SB on fixing the persistence bug

  • Continuing where I left off yesterday - to fix a single test that should throw IncorrectAccessException but instead I saw yesterday:
 uncaught exception: IOException of type ResourceBusy
  • When I sprinkle some spy' to see the values of actually thread ids I don't get this exception anymore, just the test fails. So the exception is tightly coupled with how we check for threads in the PersistenceIncremental handle.
  • I tried labeling the threads and using throwTo from MonadFork but the result is the same.
  • Tried using withBinaryFile in both source and append and use conduit to stream from/to file but that didn't help.
  • Tried using bracket with openBinaryFile and then sink/source handle in the callback but the results are the same.
  • What is happening here?

2025-03-03

SB on api server as the event sink

  • There are only two problems left to solve here. First one being the IncorrectAccessException from persistence in the cluster tests. This one I have a plan on how to solve (have a way to register a thread that will append) and the other problem is some cluster tests fail since appropriate message was not observed.

  • One example test is persistence can load with empty commit.

  • I wanted to verify is the messages are coming through since the test fails at waitFor and I see the messages propagated (but I don't see HeadIsOpened twice!)

  • Looking at the messages the Greetings message does not contain correct HeadStatus anymore! There was a projection that made sure to update this feeld in Greetings message but now we shuffled things around and I don't think this projection works any more.

  • I see all messages correct (except headStatus in Greetings) but only propagated once (and we do restart the node in our test).

  • I see api server being spun up twice but second time I don't see message replay for some reason.

  • One funny thing is I see ChainRollback - perhaps something around this is broken?

  • I see one rebase mistake in Monitoring module that I reverted.

  • After some debugging I notice that the history loaded from the conduit is always empty list. This is the cause of our problems here!

  • Still digging around code to try and figure out what is happening. I see HeadOpened saved in persistence file and can't for the life of me figure out why it is not loaded on restart. I tried even passing in the complete intact event source conduit to make sure I am not consuming the conduit in the Server leaving it empty for the WSServer but this is not the problem I am having.

  • I remapped all projections to work with StateChanged instead of ServerOutput since it makes no sense to remap to ServerOutput just for that.

  • Suspecting that mapWhileC is the problem since it would stop each time it can't convert some StateEvent to ServerOutput from disk!

  • This was it - mapWhileC stops when it encounteres Nothing so it was not processing complete list of events! So happy to fix this.

  • Next is to tackle the IncorrectAccessException from persistence. I know why this happens (obviously we try to append from different thread) and sourcing the contents of a persistence file should not be guarded by correct thread id. In fact, we should allow all possible clients to accept (streamed) persistence contents and make sure to only append from one thread and that is the one in which hydra-node process is actually running.

  • I added another field to PersistenceIncremental called registerThread and it's sole purpose is to register a thread in which we run in - so that we are able to append (I also removed the check for thread id from source and moved it to append )

  • Ok, this was not the fix I was looking for. The registerThread is hidden in the persistence handle so if you don't have access to it from the outside how would you register a thread (for example in our tests).

  • I ended up registering a thread id on append if it doesn't exist and do a check if it is there but see one failure:


  test/Hydra/PersistenceSpec.hs:59:5:
  1) Hydra.Persistence.PersistenceIncremental it cannot load from a different thread once having started appending
       uncaught exception: IOException of type ResourceBusy
       /tmp/hydra-persistence-33802a411f862b7a/data: openBinaryFile: resource busy (file is locked)
       (after 1 test)
         []
         [String "WT",Null,Null]

I still need to investigate.

February 2025

2025-02-27

SB on state of things regarding api server memory consumption

  • There is no CommandFailed and ClientEffect

  • We don't have anymore GetUTxO client input therefore I had to call api using GET /snapshot/utxo request to obtain this information (in cluster tests)

  • For the tests that don't spin the api server I used TestHydraClient and it's queryState function to obtain the HeadState which in turn contains the head UTxO.

  • One important thing to note is that I had to add utxoToCommit in the snapshot projection in order to get the expected UTxO. This was a bug we had and nobody noticed.

  • We return Greetings and InvalidInput types from the api server without wrapping them into TimedServerOutput which is a bit annoying since now we need to double parse json values in tests. If the decoding fails for TimedServerOutput we try to parse just the ServerOutput.

Current problems:

  • After adding /?history=yes to hydra-cluster tests api client I started seeing IncorrectAccessException from the persistence. This is weird to me since all we do is read from the persistence event sink.

  • Querying the hydra node state in our Model tests to get the Head UTxO (instead of using GetUTxO client input) hangs sometimes and I don't see why. I suspect this has something to do with threads spawned in the model tests:

This is the diff, it looks benign:


 waitForUTxOToSpend ::
   forall m.
-  (MonadTimer m, MonadDelay m) =>
+  MonadDelay m =>
   UTxO ->
   CardanoSigningKey ->
   Value ->
   TestHydraClient Tx m ->
   m (Either UTxO (TxIn, TxOut CtxUTxO))
-waitForUTxOToSpend utxo key value node = go 100
+waitForUTxOToSpend utxo key value node = do
+  u <- headUTxO node
+  threadDelay 1
+  if u /= mempty
+    then case find matchPayment (UTxO.pairs u) of
+      Nothing -> pure $ Left utxo
+      Just (txIn, txOut) -> pure $ Right (txIn, txOut)
+    else pure $ Left utxo
  where
-  go :: Int -> m (Either UTxO (TxIn, TxOut CtxUTxO))
-  go = \case
-    0 ->
-      pure $ Left utxo
-    n -> do
-      node `send` Input.GetUTxO
-      threadDelay 5
-      timeout 10 (waitForNext node) >>= \case
-        Just (GetUTxOResponse _ u)
-          | u /= mempty ->
-              maybe
-                (go (n - 1))
-                (pure . Right)
-                (find matchPayment (UTxO.pairs u))
-        _ -> go (n - 1)
-
   matchPayment p@(_, txOut) =
     isOwned key p && value == txOutValue txOut

Model tests sometimes succeed but this is not good enough and we don't want anymore flaky tests.

2025-02-26

SN troubleshooting unclean restarts on etcd branch

4) Test.EndToEnd, End-to-end on Cardano devnet, restarting nodes, close of an initial snapshot from re-initialized node is contested
    Process "hydra-node (2)" exited with failure code: 1
    Process stderr: RunServerException {ioException = Network.Socket.bind: resource busy (Address already in use), host = 0.0.0.0, port = 4002}
  • Seems like the hydra-node is not shutting down cleanly and scenarios like this
  • Isolated test scenarios where we simply expect withHydraNode to start/stop and restart within a certain time and not fail
  • Testing these tests on master it worked fine?! Seems to have something to do with etcd?
  • When debugging withHydraNode and trying to port it to typed-process, I noticed that we don't need the withHydraNode' variant really -> merged them
  • Back to the tests.. why are they failing while the hydra-node binary seems to behave just fine interactively?
  • With several threadDelay and prints all over the place I saw that the hydra-node spawns etcd as a sub-process, but when withProcess (any of its variants) results in stopProcess, the etcd child stays alive!
  • Issuing a ctrl+c on ghci has the etcd process log a signal detected and it shut downs
  • We are not sending SIGINT to the etcd process? Tried interruptProcessGroupOf in the Etcd module
  • My handlers (finally or bracket) are not called!? WTF moment
  • Found this issue which mentions that withProcess sends SIGTERM that is not handled by default
    • Some familiar faces on this one
    • This is also an interesting paragraph about how ctrl+c can be delegated to sub-process (not what we needed)
  • So the solution is two-fold:
    • First, we need to make sure to send SIGINT to the etcd process whenever we are asked to shut down too (in the Etcd module)
    • Also, we should initiate a graceful shutdown when the hydra-node receives SIGTERM
      • This is a better approach than making withHydraNode send a SIGINT to hydra-node
      • While that would work too, dealing SIGTERM in hydra-node is more generally useful
      • For example a docker stop sends SIGTERM to the main proces in a container

2025-02-19

SN working on etcd gprc client integration

  • When starting to use grapesy I had a conflict of ouroboros-network needing an older network than grapesy. Made me drop the ouroboros modules first.
  • Turns out we still depend transitively on the ouroboros-network packages (via cardano-api), but cabal resolver errors are even wors.
  • Adding a allower-newer: network still works
  • Is it fine to just use a newer version of network in the ouroboros-network?
  • The commits that bumped the upper bound does not indicate otherwise
  • Explicitly listed all packages in allow-newer and moved on with life

2025-02-17

SN working on etcd network connectivity

  • Working on PeerConnected (or an equivalent) for etcd network.
  • Changing the inbound type to Either Connectivity msg does not work well with the Authentication layer?
  • The composition using components (ADR7: https://hydra.family/head-protocol/adr/7) is quite complicated and only allows for a all-or-nothing interface out of a component without much support for optional parts.
  • In particular, an Etcd component that delivers Either Connectivity msg as inbound messages cannot be composed easily with the Authenticate component that verifes signatures of incoming messages (it would need to understand that this is an Either and only do it for Right msg).
  • Instead, I explore expanding NetworkCallback to not only deliver, but also provide an onConnectivity callback.
  • After designing a more composable onConnectivity handling, I wondered how the Etcd component would be determinig connectivity.
  • The etcdctl command line tool offers a member list which returns a list of members if on a majority cluster, e.g.
{"header":{"cluster<sub>id</sub>":8903038213291328342,"member<sub>id</sub>":1564273230663938083,"raft<sub>term</sub>":2},"members":\[{"ID":1564273230663938083,"name":"127.0.0.1:5001","peerURLs":\["<http://127.0.0.1:5001>"\],"clientURLs":\["<http://127.0.0.1:2379>"\]},{"ID":3728543818779710175,"name":"127.0.0.1:5002","peerURLs":\["<http://127.0.0.1:5002>"\],"clientURLs":\["<http://127.0.0.1:2380>"\]}\]}
  • But when invoked on a minority cluster it returns
  {"level":"warn","ts":"2025-02-17T22:49:48.211708+0100","logger":"etcd-client","caller":"[email protected]/retry<sub>interceptor</sub>.<go:63>","msg":"retrying
  of unary invoker
  failed","target":"etcd-endpoints://0xc000026000/127.0.0.1:2379","attempt":0,"error":"rpc
  error: code = DeadlineExceeded desc = context deadline exceeded"}
  Error: context deadline exceeded
  • When it cannot connect to an etcd instance it returns
  {"level":"warn","ts":"2025-02-17T22:49:32.583103+0100","logger":"etcd-client","caller":"[email protected]/retry<sub>interceptor</sub>.<go:63>","msg":"retrying
  of unary invoker
  failed","target":"etcd-endpoints://0xc0004b81e0/127.0.0.1:2379","attempt":0,"error":"rpc
  error: code = DeadlineExceeded desc = latest balancer error: last
  connection error: connection error: desc = \\transport: Error while
  dialing: dial tcp 127.0.0.1:2379: connect: connection refused\\"}
  Error: context deadline exceeded
  • When implementing pollMembers, suddenly the waitMessages was not blocked anymore?
  • While a litte crude, polling member list works nicely to get a full list of members (if we are connected to the majority cluster).
  • All this will change when we switch to a proper grpc client anyways

2025-02-05

SB on running conduit only once for projections

  • Current problem we want to solve is instead of passing a conduit to mkProjection function and running it inside we would like to stream data to all of the projections we have.

  • Seems like this is easier said than done since we also rely on a projection result which is a Projection handle that is used to update the TVar inside.

  • I thought it might be a good idea to alter mkProjection and make it run in ConduitT so it can receive events and propagate them further and then, in the end return the Projection handle.

  • I made changes to the mkProjection that compile

mkProjection ::
-  (MonadSTM m, MonadUnliftIO m) =>
+  MonadSTM m =>
   model ->
   -- | Projection function
   (model -> event -> model) ->
-  ConduitT () event (ResourceT m) () -> 
-  m (Projection (STM m) event model)
-mkProjection startingModel project eventSource = do
-  tv <- newTVarIO startingModel
-  runConduitRes $
-    eventSource .| mapM_C (lift . atomically . update tv)
-  pure
+  ConduitT event (Projection (STM m) event model) m ()
+mkProjection startingModel project = do
+  tv <- lift $ newTVarIO startingModel
+  meventSource <- await
+  _ <- case meventSource of
+    Nothing -> pure ()
+    Just eventSource ->
+      void $ yield eventSource .| mapM_C (atomically . update tv)
+  yield $
     Projection
       { getLatest = readTVar tv
       , update = update tv

but the main issue is that I can't get the results of all projections we need in the end that easy.

-- does not compile
headStatusP <- runConduitRes $ yield outputsC .| mkProjection Idle projectHeadStatus
  • We need to be able to process streamed data from disk and also output like 5 of these projections that do different things.
  • I discovered sequenceConduits which allows collection of the conduit result values.
  • Idea was to collect all projections which have the capability of receiving events as the conduit input.
[headStatusP] <- runConduit $ sequenceConduits [mkProjection Idle projectHeadStatus] >> sinkList
  • Oh, just realized sequenceConduits need to have exactly the same type so my plan just failed

I think I need to revisit our approach and start from scratch.

January 2025

2025-01-23

SB on state events streaming

  • So what we want to do is to reduce the memory footprint in hydra-node as the final outcome

  • There are couple of ADRs related to persisting stream of events and having different sinks that can read from the streams

  • Our API needs to become one of these event sinks

  • The first step is to prevent history output by default as history can grow pretty large and it is all kept in memory

  • We need to remove ServerOutput type and map all missing fields to StateChange type since that is what we will use to persist the changes to disk

  • I understand that we will keep existing projections but they will work on the StateChange type and each change will be forwarded to any existing sinks as the state changes over time

  • We already have PersistenceIncremental type that appends to disk, can we use similar handle? Most probably yes - but we need to pick the most performant function to write/read to/from disk.

  • Seems like we currently use eventPairFromPersistenceIncremental to setup event stream/sink. What we do is load all events from disk. We also have a TVar holding the event id. Ideally what we would like is to output every new event in our api server. I should take a look at our projections to see how we output individual messages.

  • Ok, yeah, projections are displaying the last message but looking at this code I am realizing how complex everything is. We should strive for simplicity here.

  • Another thought - would it help us to use Servant at least to separate the routing and handlers? I think it could help but otoh Servant can get crazy complex really fast.

  • So after looking at the relevant code and the issue https://github.com/cardano-scaling/hydra/issues/1618 I believe the most complex thing would be this Websocket needs to emit this information on new state changes. but even this is not hard I believe since we have control of what we need to do when setting up event source/sink pair.

SN on streaming events

  • Streaming events using conduit makes us buy into the unliftio and resourcet environment. Does this go well with our MonadThrow et al classes?
  • When using conduits in createHydraNode, the runConduitRes requires a MonadUnliftIO context. We have a IOSim usage of this though and its not clear if there can be a MonadUnliftIO (IOSim s) instance even?
  • We have not only loading [StateEvent] fully into memory, but also [ServerOutput].
  • Made mkProjection to take a conduit, but then we are running it for each (3 times). Should do something with fuseBoth or zip-like conduit combination.

2025-01-22

SN on multi version explorer

  • Started simplifying the hydra-explorer and wanted to get rid of all hydra-node, hydra-tx etc. dependencies because they include most of the cardano ecosystem. However, on the observer api we will need to refer to cardano-specifics like UTxO and some hydra entities like Party or HeadId. So a dependency onto hydra-tx is most likely needed.
  • Shouldn't these hydra specific types be in an actual hydra-api package? The hydra-tx or a future hydra-client could depend on that then.
  • When defining the observer API I was reaching for the OnChainTx data type as it has json instances and enumerates the things we need to observer. However, this would mean we need to depend on hydra-node in the hydra-explorer.
  • Could use the HeadObservation type, but that one is maybe a bit too low level and does not have JSON instances?
  • OnChainTx is really the level of detail we want (instantiated for cardano transactions, but not corrupted by cardano internal specifics)
  • Logging in the main entry point of Hydra.Explorer is depending on hydra-node anyways. We could be exploring something different to get rid of this? Got https://hackage.haskell.org/package/Blammo recommended to me.
  • Got everything to compile (with a cut-off hydra-chain-observer). Now I want to have an end-to-end integration test for hydra-explorer, that does not concern itself with individual observations, but rather that the (latest) hydra-chain-observer can be used with hydra-explorer. That, plus some (golden) testing agains the openapi schemas should be enough test coverage.
  • Modifying hydra and hydra-explorer repositories to integration test new http-based reporting.
    • Doing so offline from a plane is a bit annoying as both nix or cabal would be pulling dependencies from the internet.
    • Working around using an alias to the cabal built binary:
        alias hydra-chain-observer=../../hydra/dist-newstyle/build/x86_64-linux/ghc-9.6.6/hydra-chain-observer-0.19.0/x/hydra-chain-observer/build/hydra-chain-observer/hydra-chain-observer
  • cabal repl is not picking up the alias, maybe need to add it to PATH?
  • Adding a export PATH=<path to binary>:$PATH to .envrc is quite convenient
  • After connecting the two servers via a bounded queue, the test passes but sub-process are not gracefully stopped.

2025-01-21

SB on stake certificate registration

  • I created a relevant issue to track this new feature request to enable stake certificates on L2 ledger.
  • Didn't plan on working on this right away but wanted to explore a problem with PPViewHashesDontMatch when trying to submit a new tx on L2.
  • This happens both when obtaining the protocol-parameters from the hydra-node or if I query them from cardano-node (the latter is expected to fail on L2 since we reduce the fees to zero)
  • I added the line to print the protocol-parameters in our tx printer and it seems like changePParams is not setting the protocol-parameters correctly for whatever reason:
changePParams :: PParams (ShelleyLedgerEra Era) -> TxBodyContent BuildTx -> TxBodyContent BuildTx
changePParams pparams tx =
  tx{txProtocolParams = BuildTxWith $ Just $ LedgerProtocolParameters pparams}
 
  • There is setTxProtocolParams I should probably use instead.
  • No luck, how come this didn't work? I don't see why setting the protocol-parameters like this doesn't work....
  • I even compared the protocol-parameters loaded into the hydra-node and the ones I get back from hitting the hydra-node api and they are the same as expected
  • Running out of ideas

2025-01-20

SB on looking at withdraw zero problem

  • I want to know why I get mismatch between pparams on L2?
  • It is because we start the hydra-node in a separate temp directory from the test driver so I got rid of the problem by querying hydra-node to obtain L2 protocol-parameters
  • The weird issue I get is that the budget is overspent and it seems bumping the ExecutionUnits doesn't help at all.
  • When pretty-printing the L2 tx I noticed that cpu and memory for cert redeemer are both zero so that must be the source of culprit
  • Adding separately cert redeemer fixed the issue but I am now back to PPViewHashesDontMatch.
  • Not sure why this happens since I am doing a query to obtain hydra-node protocol parameters and using those to construct the transaction.
  • Note that even if I don't change protocol-parameters the error is the same
  • This whole chunk of work is to register a script address as a stake certificate and I still need to try to withdraw zero after this is working.
  • One thing I wanted to do is to use the dummy script as the provided Data in the Cert Redeemers - is this even possible?

2025-01-08

SN on aiken pinning & cleanup

  • When trying to align aiken version in our repository with what is generated into plutus.json, I encountered errors in hydra-tx tests even with the same aiken version as claimed.

  • Error: Expected the B constructor but got a different one

  • Seems to originate from plutus-core when it tries to run the builtin unBData on data that is not a B (bytestring)

  • The full error in hydra-tx tests actually includes what it tried to unBData: Caused by: unBData (Constr 0 [ Constr 0 [ List [ Constr 0 [ Constr 0 [ B #7db6c8edf4227f62e1233880981eb1d4d89c14c3c92b63b2e130ede21c128c61 , I 21 ] , Constr 0 [ Constr 0 [ Constr 0 [ B #b0e9c25d9abdfc5867b9c0879b66aa60abbc7722ed56f833a3e2ad94 ] , Constr 1 [] ] , Map [(B #, Map [(B #, I 231)])] , Constr 0 [] , Constr 1 [] ] ] , Constr 0 .... This looks a lot like a script context. Maybe something off with validator arguments?

  • How can I inspect the uplc of an aiken script?

  • It must be the "compile-time" parameter of the initial script, which expects the commit script hash. If we use that unapplied on the transaction, the script context trips the validator code.

  • How was the initialValidatorScript used on master such that these tests / usages pass?

  • Ahh .. someone applied the commit script parameter and stored the resulting script in the plutus.json! Most likely using aiken blueprint apply -v initial and then passing the aiken blueprint hash -v commit into that.

  • Realized that the plutus.json blueprint would have said that a script has parameters.

⚠️ **GitHub.com Fallback** ⚠️