dist note - modrpc/info GitHub Wiki

Table of Contents Lecture 2: RPC, Threads RPC at-least-once behavior at-most-once behavior exactly-once Lecture 3: Primary/Backup Replication Fault Tolerance Failure Model: What will we try to cope with? Core idea: Replication Big Questions re: Replication Two Main Replication Approaches State transfer Replicated state machine Comparison Case Study: Remus Testing Distributed Systems Tanenbaum: Distributed Operating Systems Ch #2: Communication in Distributed Systems 2.3.4 Blocking versus nonblocking primitives

Lecture 2: RPC, Threads

RPC

at-least-once behavior

WHAT: 1, 2, ... executions
HOW: wait for response; if no response, re-send request N times
ISSUE: e.g. "deduct $10 from bank account"
- Multiple execution is harmful
- Could be ok if execution is ideempotent (e.g. read-only); or some application-level handling of duplicates

at-most-once behavior

WHAT: 0 or 1 execution
HOW: to avoid executing twice
- CLIENT: Each request contains XID (unique ID) -- same XID for re-send
- SERVER: store response for each XID-request; if same XID-request, send stored response
HOW: when to discard saved responses
- CLIENT sends "seen all replies <= X (you can discard responses for <= X) " with every RPC
- CLIENT can have at most one outstanding call at a time (no overlapping calls)

exactly-once

at-most-once + unbounded retires + fault-tolerent services

Lecture 3: Primary/Backup Replication

Fault Tolerance

we'd like a service that continues despite failures!
available: still useable despite [some] failures
correct: act just like a single server to clients
very hard! but very useful!

Failure Model: What will we try to cope with?

Independent fail-stop computer failure
- Remus further assumes only one failure at a time
Site-wide power failure (and eventual reboot)
(Network partition)
No bugs, no malice

Core idea: Replication

Two servers (or more)
Each replica keeps state needed for the service
If one replica fails, others can continue

Big Questions re: Replication

What state to replicate?
How does replica get state?
When to cut over to backup?
Are anomalies visible at cut-over?
How to repair / re-integrate?

Two Main Replication Approaches

State transfer

"Primary" replica executes the service
Primary sends [new] state to backups

Replicated state machine

wiki
All replicas execute all operations
If same start state,
- same operations,
- same order,
- deterministic,
- then same end state

Comparison

State transfer is simpler but slow to transfer

Case Study: Remus

Remus: High Availability via Asynchronous Virtual Machine Replication (NDIS'08)
uses state transfer
Failure model
- independent hardware faults
- site-wide power failure

Testing Distributed Systems

Tanenbaum: Distributed Operating Systems

Ch #2: Communication in Distributed Systems

2.3.4 Blocking versus nonblocking primitives

Ⓜ️ mv

⚠️ GitHub.com Fallback ⚠️