dist note - modrpc/info GitHub Wiki
- WHAT: 1, 2, ... executions
- HOW: wait for response; if no response, re-send request N times
- ISSUE: e.g. "deduct $10 from bank account"
- Multiple execution is harmful
- Could be ok if execution is ideempotent (e.g. read-only); or some application-level handling of duplicates
- WHAT: 0 or 1 execution
- HOW: to avoid executing twice
- CLIENT: Each request contains XID (unique ID) -- same XID for re-send
- SERVER: store response for each XID-request; if same XID-request, send stored response
- HOW: when to discard saved responses
- CLIENT sends "seen all replies <= X (you can discard responses for <= X) " with every RPC
- CLIENT can have at most one outstanding call at a time (no overlapping calls)
- at-most-once + unbounded retires + fault-tolerent services
- we'd like a service that continues despite failures!
- available: still useable despite [some] failures
- correct: act just like a single server to clients
- very hard! but very useful!
- Independent fail-stop computer failure
- Remus further assumes only one failure at a time
- Site-wide power failure (and eventual reboot)
- (Network partition)
- No bugs, no malice
- Two servers (or more)
- Each replica keeps state needed for the service
- If one replica fails, others can continue
- What state to replicate?
- How does replica get state?
- When to cut over to backup?
- Are anomalies visible at cut-over?
- How to repair / re-integrate?
- "Primary" replica executes the service
- Primary sends [new] state to backups
- wiki
- All replicas execute all operations
- If same start state,
- same operations,
- same order,
- deterministic,
- then same end state
- State transfer is simpler but slow to transfer
- Remus: High Availability via Asynchronous Virtual Machine Replication (NDIS'08)
- uses state transfer
-
Failure model
- independent hardware faults
- site-wide power failure