Horcrux Signer Migration Runbook - Cumulo-pro/Horcrux-Architecture GitHub Wiki

Generic Procedure -- 3 Cosigners (Threshold 2/3)

This document describes the safe migration of a single Horcrux signer to a new server (new IP / new provider / new region) without resharing keys and without modifying the threshold.

The procedure is intentionally generic and does not reference any specific infrastructure.


Scope

βœ… Replace one cosigner with a new server
βœ… Keep same shard and ECIES keys
βœ… Rebuild Raft cluster cleanly
βœ… Maintain threshold signing (2/3)

❌ Not covered: - Changing threshold - Adding or removing cosigners - Re‑sharding validator key - Rotating validator key


Cluster Assumptions

  • 3 cosigners
  • threshold = 2
  • Cosigner P2P port: 2222/tcp
  • Sentry priv_validator_laddr port: 1234/tcp
  • All cosigners can reach all sentries
  • SSH access to all signer nodes

Critical Directory Structure

~/.horcrux/ β”œβ”€β”€ config.yaml β”œβ”€β”€ ecies_keys/ (must match shardID) β”œβ”€β”€ shards/ (must match shardID + chain_id) β”œβ”€β”€ state/ β”‚ β”œβ”€β”€ _priv_validator_state.json β”‚ └── _share_sign_state.json └── raft/ (MUST NOT be copied; MUST be rebuilt)


Files to Copy

βœ… Copy: - config.yaml - ecies_keys/ - shards/ - state/ (ONLY after cluster is stopped)

❌ Never copy: - raft/ - horcrux.pid


PHASE 0 -- Prepare the New Signer (No Production Impact)

0.1 Provision New Server

  • Install OS updates
  • Create user
  • Configure firewall (open 2222/tcp)
  • DO NOT start Horcrux

0.2 Install Exact Same Horcrux Version

Verify:

horcrux version
which horcrux

Version must match existing production cluster.

0.3 Prepare Horcrux Home

mkdir -p ~/.horcrux
chmod 700 ~/.horcrux

PHASE 1 -- Copy Static Materials

From the signer being replaced:

Copy ONLY:

~/.horcrux/config.yaml
~/.horcrux/ecies_keys/
~/.horcrux/shards/

DO NOT copy: ~/.horcrux/raft ~/.horcrux/state (yet)

Verify structure on new server before proceeding.


PHASE 2 -- Coordinated Stop (Production Impact Begins)

On ALL existing signers:

sudo systemctl stop horcrux

Verify stopped:

pgrep horcrux || echo "stopped"

This prevents: - leader instability - partial raft writes - inconsistent state propagation


PHASE 3 -- Copy State Files

After all signers are stopped:

Copy:

~/.horcrux/state/

to the new signer.

Verify:

cat ~/.horcrux/state/<chain>_priv_validator_state.json
cat ~/.horcrux/state/<chain>_share_sign_state.json

Heights must look valid and non‑zero.


PHASE 4 -- Update config.yaml on All Signers

On ALL remaining signers AND the new signer:

Update only the p2pAddr for the shard being replaced:

thresholdMode:
  threshold: 2
  cosigners:
  - shardID: X
    p2pAddr: tcp://NEW_IP:2222

DO NOT modify: - shardID - threshold - keys - chainNodes


PHASE 5 -- Delete Raft Directory (Critical Step)

On ALL signers (old and new cluster members):

rm -rf ~/.horcrux/raft

Verify deletion:

test ! -d ~/.horcrux/raft && echo "raft removed"

Skipping this step may cause: - height regression errors - failed shard signing - chain id cannot be empty - persistent leader instability


PHASE 6 -- Decommission Old Signer

Ensure the old signer does NOT rejoin:

sudo systemctl disable --now horcrux

Optionally shut down or destroy the old server.


PHASE 7 -- Start the New Cluster

Start Horcrux on all active signers:

sudo systemctl start horcrux

Check logs:

journalctl -u horcrux -f

Expected patterns:

  • I am the leader
  • Signed chain_id=...
  • No repeated shard errors
  • No height regression loops

PHASE 8 -- Validate Cluster Health

Check Leader

horcrux leader

If Leader Election Appears Inconsistent

  1. Stop all signers

  2. Delete raft on all

  3. Restart all signers

    sudo systemctl stop horcrux rm -rf ~/.horcrux/raft sudo systemctl start horcrux


Operational Notes

  • Always copy state AFTER stopping all signers.
  • Never copy raft between servers.
  • Minor block misses during migration are normal.
  • If you observe continuous height regression, repeat raft cleanup.

End of Document

⚠️ **GitHub.com Fallback** ⚠️