replication algo - openacid/celeritasdb GitHub Wiki

There is a mistake: FastAccept has to check final_deps, because execution-linearizability requires new instance have to include all preceding instance.

Goals

  • Remove infinite strongly-connected-components and livelock.
  • Remove seq.
  • Remove defer during recovery.

Major changes from epaxos:

  • When updating deps, set deps to accumulated deps, which includes all reachable instances by walking through the dep-graph.

  • When updating deps for FastAccept of a, add x into a.deps only when x does not know a: x < a

  • initial_deps is useless and removed: recovery does not need initial_deps. only deps.

  • Instances by a same leader has a strong depends-on relation. A later instance always depends on a former one. This is guaranteed by handling FastAccept request sequentially.

  • Use the all-committed constrain. all-initial-value and only-to-quorum is not used.

Terminology

  • R0, R1 ... or R[0], R[1]... : replica.

  • a, b ... x, y... : instance.

  • La, Lb: is the leader replica of an instance a or b.

  • F: number of max allowed failed replica that.

  • n: number of replicas, n = 2F+1.

  • Qc: classic quorum: Qc = F+1.

  • Qf: fast quorum: Qf = F+โŒŠQc/2โŒ‹ = F + โŒŠ(F+1)/2โŒ‹.

  • aโ‚€: initial value of instance a.

  • aโ‚โฑ: updated instance a by R[i] when it is forwarded to replica R[i].

  • aโ‚‚: value of instance a some relica believes to be safe.

  • โ†’: depends-on: a โ†’ b means a knows of b.

  • <: do-not-know: a < b means a does not know of b.

Def-instance

An instance is an internal representation of a client request.

Two essential fields are:

  • cmds the commands a client wants to execute.
  • deps what other instance it sees before being committed. This field is to determine the instance execution order.
type InstanceID(ReplicaID, i64)

type Instance {

    deps:          Vec<InstanceID>;
    committed:     bool;
    executed:      bool;

    cmds: Vec<Commands>;
    ballot: BallotNum;
}

Def-deps

a.deps: is instance id set of all instances that directly or indirectly interfering with a. E.g., a ~ b ~ c but a โ‰ c, a.deps has c in it because c indirectly interfering with a through b.

Describe deps: a.direct_deps are the highest instances a directcly interferes. TODO proof that we only need to update from directly interfering instances.

On implementation, a.deps is split into N subset, where N is number of replicas. Every subset contains only instances from leader Ri: a.deps[Ri] = {x | x.replicaID == Ri and a โ†’ x}.

And a.deps[Ri] records only the max instance id.

Do-not-need-bidirection-knows

When updating deps for FastAccept of a, add x into a.deps only when x does not know a: x < a. i.e., if x โ†’ a, then a < x.

On all replicas that there are both a and x, FastAccept status a, x can not have a < x and x < a. If a < x because of an Accepted x, which means there is another replica on which x โ†’ a, that conflict with our assumption that intersection.

โˆด a < b always hold if a, b is initiated by a same leader and a ~ b.

Definition instance space

The entire instance space is a 3d array:

R[i][j][idx]

Fields:

  • i: replicaID.
  • j: replicaID.
  • idx: index of a instance.

Explain:

  • R[i]: all data on replica i
  • R[i][j]: instances initiated by replica j those are stored on replica i.
  • R[i][j][idx]: idx-th instance initiated by replica j.

instance space layout

|                                                                        |
|                                                                        |
|                                                                        |
|                    c     f             c     f              c    f     |
|              a     b     e       a     b     e        a     b    e     |
|              ----------------    ----------------    ----------------  |
| leader:      [0]   [1]   [2]     [0]   [1]   [2]     [0]   [1]   [2]   |
|              ================    ================    ================  |
| replica:     R[0]                R[1]                R[2]              |

We may write R[0] as R0 for short.

Examples of relation depends-on

  • Initially, there are 3 instances x, y, z.

    When a is initiated on R0, R0 believes it knows of all others: aโ‚€ โ†’ {x, y, z}.

    When b is initiated on R1, R1 believes it knows of all others: bโ‚€ โ†’ {x, y, z}.

    When c is initiated on R2, R2 believes it knows of all others: cโ‚€ โ†’ {x, y, z}.

    When d is initiated on R0, R0 believes it knows of all others: dโ‚€ โ†’ {a, x, y, z}.

    d
    โ†“
    a            b            c
    x y z      x y z      x y z
    -----      -----      -----
    R0         R1         R2
    

Simple case:

When d is replicated to R1, R1 believes that dโ‚ยน โ†’ {a, b, x, y, z}.

dโ‚ยน got a new relation dโ‚ยน โ†’ b:

d          d
โ†“           โ†˜
a            b            c
x y z      x y z      x y z
-----      -----      -----
R0         R1         R2

Transitivity:

Then c is replicated to R1, R1 believes that cโ‚ยน โ†’ {d, a, b, x, y, z}.

cโ‚ยน got three new relations cโ‚ยน โ†’ {b, d, a}( because R1 believes d โ†’ a thus cโ‚ยน โ†’ a):

              .c
            โ†™  |
d          d   |
โ†“           โ†˜ โ†™
a            b            c
x y z      x y z      x y z
-----      -----      -----
R0         R1         R2

Not to override existent replation:

Then a is replicated to R1, R1 believes that aโ‚ยน โ†’ {b, x, y, z}.

aโ‚ยน got only one new relation aโ‚ยน โ†’ b: R1 already believes dโ‚€ โ†’ a because it had received dโ‚€ from R0. cโ‚ยน โ†’ d thus cโ‚ยน โ†’ a.

              .c
            โ†™  |
d          d   |
โ†“          โ†“โ†˜ โ†™
a          aโ†’b            c
x y z      x y z      x y z
-----      -----      -----
R0         R1         R2

We see that different replicas have their own view of instance relations.

Definition: commit

The action commit is to broadcast to all replica about what value is safe.

Definition: safe

Some value(e.g. an instance or a relation or something else) is safe if: it has been forwarded to enough replicas and constituted a quorum(Qf or Qc) so that no process(command leader or recovery process) would never choose other value for it to commit.

Example: safe

a is safe if every a.deps[Ri] is safe.

       aโ‚ยน    aโ‚ยฒ    aโ‚ยณ
       |      |      โ†“
       |      |      cโ‚ยณ
       โ†“      โ†“      โ†“
aโ‚€     bโ‚€     cโ‚€     bโ‚€
---    ---    ---    ---    ---
R0     R1     R2     R3     R4
aโ‚ยน.deps = {b}
aโ‚ยฒ.deps = {c}
aโ‚ยณ.deps = {b, c}

Thus a.deps = {b, c} can be committed.

Commit an instance

In this algorithm we need to ensure two things to be safe, before committing it:

  • What to execute: a.cmds.

    To commit a.cmds, forwards it to Qc=F+1 replicas, because a.cmds never changes.

  • and when to execute: a.deps.

    a.deps have different values on different replicas. Thus it requires Qf replicas to have the identical value to be safe.

Commit "a.deps"

Since a.deps has n indepedent fields:

a.deps = {
    0: x,
    1: y,
    ...
}
  • If all a.deps[Ri] is safe, a is safe. Then leader commit it on fast-path.

  • Otherwise if any of a.deps[Ri] is not safe, run another round of Accept to make it safe(slow path).

FP-condition

Conditions must be sastisified to commit on fast-path:

  • For every updated a.deps[i] == x, the leader received at least one reply with committed x

    TODO remove this: and x < a.

  • a.deps[i] == x constitutes a fast-quorum.

These two condition guarantees that x will never depends on a. This is necessary to recover a fast-committed instance.

Proof: all replica have the same view of committed instance

TODO obviously

There is only one value could be chosen to be safe.

โˆด finally an instance is committed the same value on all replias.

โˆด All replicas have the same set of instances.

Fast path

Leader:

  1. Initiate instance a

    build a.deps:

    a.deps = {a}
    
    for x in local_instances:
        if a ~ x:
            a.deps โˆช= x.deps
    
    
  2. FastAccept: forward a to other replicas.

  3. Handle-FastAcceptReply

    Update a.deps:

    a.commitDeps = [];
    for i in 0..n:
        values = {a.deps[i] for a in all_replies}
        for v in values:
            if (count(v, all_replies) >= fast_quorum
                    and v is committed):
    
                a.commitDeps[i] = v
                break
        if a.commitDeps[i] is NULL:
            quit fast-path.
    
    commit(a)
    

Non-leader replicas:

  1. Handle-FastAccept

    TODO need proof of linearizability with this. TODO explain why this is efficient reducing conflict.

    update a.deps'

    committed flag are ignored in this pseudo code for clarity

    La = leaderOf(a)
    
    // on other replica there may be more deps
    for x in a.deps:
        if x < a:
            a.deps โˆช= x.deps
    
    // check instances not included in a.deps
    for x in local_instances - a.deps:
        if a ~ x:
            if x < a:
                a.deps โˆช= x.deps
    
    // update committed flag
    for x in a.deps:
        committed[leaderOf(x)] = x.isCommitted
    
    reply(a, committed)
    

Slow path

Leader:

  1. Choose a.deps

  2. Send Accept to replicas

  3. Handle AcceptReply

Non-leader replicas:

  1. Handle Accept

Commit

Just commit.

Messages

  • All request messages have 3 common fields:

    • req_type identify type: FastAccept, Accept, Commit or Prepare.

    • ballot is the ballot number,

      • For FastAccept it is always 0.
      • Fast path Accept ballot is 1.
      • Slow path Accept ballot is 2 or greater.
      • ballot in Commit message is useless.
      • ballot in a Prepare is chosen by recovery process and should be >2.
    • instance_id is the instance id this request for.

  • All reply messages have 3 common fields:

    • req_type.
    • last_ballot is the ballot number before processing the request.
    • instance_id.

TODO Changes:

deps_committed is useless in fast-accept: Without deps_committed no fast-commit will be delayed.

To fast-commit a > x: If x is slow-committed, an Accept status x will be seen. Thus a can be fast-committed.

If x is fast-committed:

  • If a reached Lx, then a know if x is committed, because Lx is the first to commit. Although there is chance x is committed after a reaches Lx, Lx broadcasts x is committed very likely earlier than another instance brings x is committed through its fast-accept request.

  • If a did not reach Lx, then a must have reached g - {La, Lx}, this prevent other value of a > y to commit. โˆด a > x is safe to fast commit.

FastAccept request

  • cmds: the commands to run.
  • deps: the deps when leader initiate the instance.

FastAccept reply

  • deps: udpated deps by a replica.
  • deps_committed: a vector of committed flag of every instance in deps TODO: to fast-commit a > x, x is accepted is also an acceptable condition. use another field to describe status of x. maybe x.ballot.

Accept request

  • final_deps: the deps chosen by leader or recovery process.

Accept reply

Nothing except the common fileds.

Commit request

  • cmds: the commands to run.
  • final_deps: the deps chosen by leader or recovery process.

Commit reply

Nothing except the common fileds.

Prepare request

Nothing except the common fileds.

Prepare reply

  • committed is the committed flag of the instance on a replica.

Execution

Execution order

On a replica, a.deps represents local relation between a and other instances on this replia. There is not circle, e.g.: a > b > c > a on a replica.

But a.final_deps could have.

E.g. initially:

aโ‚€        bโ‚€
-----  -----
R0     R1

R0 and R1 forward a and b to the other:

   bโ‚โฐ  aโ‚ยน
 โ†™       โ†˜
aโ‚€        bโ‚€
-----  -----
R0     R1

Finaly a.final_deps = {b}, b.final_deps = {a}.

For this reason execution order is determined not by relation >, but by comparing the vecotr a.final_deps.

TODO

Guarantees:

G-exec-consistency

Execution consistency: If two interfering commands a and b are successfully committed, they will be executed in the same order by every replica.

G-exec-finite

Every instance must be executed in finite time, e.g. no livelock with a SCC.

G-exec-linear

Execution linearizability:

If two instances are serialized by client(a is proposed only after b is committed by any replica), then every replica will execute b before a.

TODO Def-after

Order is defined as:

  • a.deps โŠƒ b.deps : exec a after b. From Def-after, if a.deps โŠƒ b.deps, execute a after b guarantees linearizability.

  • Otherwise: exec a and b in instance id order.

Proof

If a is initiated after b became safe, b will never see a, and a will definitely see b. Then b will be added into a.deps.

Thus a.deps โŠƒ b.deps TODO what if see a accepted

โˆด execute a after b will never break guarantees.

Execution algorithm

Guarantees:

In the following digram, a.deps โŠƒ d.deps thus a should be executed after d. b and e interferes but b.deps โŠ… e.deps or e.deps โŠ… b.deps. They could be executed in any order that is identical on every replica.

        a      b
          โ†˜ โ†™   ~
     c  ~  d  ~  e
      โ†˜   โ†™ โ†˜
        f    g

Algo-1

One general exec algo is by walking the depends-on graph, remove some edges to reduce the graph to a DAG and instances have determined order to execute.

See other doc TODO

Algo-2

Another exec algo is much simpler to proof correctness but requires additional constrains: for instances on a leader, a newer instance must be executed after an older instance.

Our replication algo One of the constrain must be applied to replication:

  • A replica must handle older instance before handling a newer one.
  • Or the deps of an older instance must not include deps of the newer instance, when handling FastAccept.

Either one of the above guarantees newer.deps โŠƒ older.deps.

See other doc TODO

Recover

Assumes:

  • The instance to recover is a.
  • The leader of a La is R0
  • The recovery process is P(P != R0).

Cases not need to recover:

After Preparing on a quorum(F+1):

  • If P saw R0, exit and wait for R0 to commit a.

  • If P saw a committed a: broadcast and quit.

  • If P saw a with ballot>0: run classic paxos with this value and quit.

    TODO explain ballot

โˆด P only need to recover if all of a it saw are in FastAccept phase.

Recover FastAccept instance

Recovery is to choose a value of a.deps that could have been committed on fast-path.

P tries to choose a value for a.deps[0], a.deps[1] ... one by one.

First we start to recover a.deps[1].

a.deps[La] is will never change thus do not need to recover it.

Recover one relation

After Prepare on a quorum, P could see different values of a.deps[1](x, y...) from different replicas.

Assumes that x > y and leader of x, y is R1.

  • Define Nx to be the the number of PrepareReply with a.deps[1] == x.
  • Define Ny to be the the number of PrepareReply with a.deps[1] == y.
  • ...

As the following diagram shows:

       x     ...    a.deps[1]=x    a.deps[1]=y
       y
a      z
---    ---   ...    ---           ---
R0     R1    ...    R2

Case-1: R1 is unreachable, there could be two possibly committed value of a.deps[1].

E.g.:

        x | aโ†’x   a     a
        โ†“ |   โ†“   |     |
a       y |   y   `โ†’y   `โ†’y
---   --- | ---   ---   ---
R0    R1  | R2    R3    R4
La    Lb
down  down

x is not SlowCommit-ed. If x is FastCommit-ed, x โ†’ a. โˆด From FP-condition, a โ†’ x is not FastCommit-ed.

In this situation, both a.deps[1] == x and a.deps[1] == y could have been committed on fast-path.

Lemma-1: R1 does not have a < x on it

โˆต a.deps[1] = x has been seen.

โˆด Initially, a < x is not on R1 and it will never be.

But R1 could have a โ†’ x on it.

Lemma-2: R0 does not have a > x on it.

โˆต a.deps[1] = y has been seen.

โˆด Initially, a โ†’ x is not on R0 and it will never be.

But R0 could have a โ†’ y on it.

There are F + Nx replicas accepted or could have accepted a.deps[1] == x. F unreached replicas plus Nx.

There are F -1 + Ny replicas accepted or could have accepted a.deps[1] == y. F unreached replicas except R1 plus Ny.

If both of these two constituted fast-quorum(F + โŒŠ(F+1)/2โŒ‹)), it requires:

  • Nx >= โŒŠ(F+1)/2โŒ‹
  • Ny >= โŒŠ(F+1)/2โŒ‹ + 1
  • Nx + Ny <= F + 1

Thus if F=2k there could be two value possibly committed on fast-path:

Nx = โŒŠ(F+1)/2โŒ‹ = k
Ny = โŒŠ(F+1/2)โŒ‹ + 1 = k + 1

E.g.:

  • R0, R1, R2 could have constituted a fast-quorum for a.deps[1] = x.
  • R0, R3, R4 could have constituted a fast-quorum for a.deps[1] = y.

P needs to eliminate one of them

  • a did not see a fast-committed x

If a.deps[1] == x is committed, by fast-commit requirements, the committed x.deps must not contain a(x < a)

Lemma-x-fast-gt-a

x could only have been committed on fast path with x > a:

โˆต Nx = โŒŠ(F+1)/2โŒ‹.

โˆด There are at most F - 1 + Nx = F + โŒŠ(F+1)/2โŒ‹ - 1 < F + โŒŠ(F+1)/2โŒ‹ replicas accepts x < a in FastAccept phase:

F unreached replicas except R0, plus Nx.

โˆด x < a can not constitute a fast-quorum.

โˆด If x is fast-committed, it must be committed with x > a.

slow-committed value of x

After Prepare on x.

Get the value of x that is accepted with the latest ballot, or the value of committed x:

Choose a.deps[1] == x if: the value of x is NOT nil and x < a. Otherwise continue try checking a.deps[1] == y.

Proof
  • If the value is NOT nil, x could have been committed on slow-path.

    • If x > a, From FP-condition, a.deps[1] == x could not have been committed on fast-path.

      Discard a.deps[1] == x, try other value of a.deps[1].

    • If x < a, there is a at least classic-quorum on which x came before a.

      โˆด x must be in a.deps when a is committed.

      โˆด a.deps[1] == x is the only possible value to commit.

  • If the value is nil, x is not committed on slow-path.

    Assumes x is committed on fast-path:

    From Lemma-x-fast-gt-a, x > a must be committed.

    โˆด From FP-condition, a.deps[1] == x can NOT be committed on fast-path.

    โˆด Discard a.deps[1] == x, try other value of a.deps[1].

Continue checking if a.deps[1] == y can be committed on fast-path, and so on. If no value of a.deps[1] could have been committed, use the highest instnace id.

Case-2: R1 is unreachable, only one possibly committed value of a.deps[1].

Choose the only value to commit a.deps[1].

Case-3: R1 is reached.

E.g.:

     |   x  aโ†’x   a   |
     |   โ†“    โ†“   |   |
a    |   y    y   'โ†’y |
---  | ---  ---   --- |  ---
R0   | R1   R2    R3  |  R4
La     Lb
down                     down

If x is committed,

  • If x < a, then only a.deps[1] == x can be committed. Because if a.deps[1] == y is committed, a and x has no relation, which is impossible.

  • If x > a, then a.deps[1] == x can not be committed. Because on R2 x does not have x > a, this is not a committed value of x.

If x is not committed, wait until it is committed.

Continue repeat these step on a.deps[1] == y to choose a value.

Recover algorithm

Prepare for a

retrieve instance a on every replicas in a classic-quorum, along with the dependent instance. :

from R0: a.deps = [x, y, z, ...]; x = (accepted, x.deps=...), y = ...
from R1: a.deps = [u, v, w, ...]; w = (fast, u.deps=...), v = ...
...

collect all values of a.dpes[1]: [x, y, z]

sort them in top-down order, check if there is one that could have been fast-committed by La: for x,

  • if Lx is in quorum, wait for Lx to commit x. TODO does not need to wait.
  • otherwise, choose a.deps[1] == x if: the value of x is NOT nil and x < a.

If there is such an x, choose this a.deps[1] = x. Otherwise choose the first.

Send Accept requests and commit.

โš ๏ธ **GitHub.com Fallback** โš ๏ธ