-
Notifications
You must be signed in to change notification settings - Fork 2
replication acc dep
2020 Aug 11
Major simplifications with comparison to epaxos:
-
Removed infinite strongly-connected-components and livelock from execution algo.
-
Removed
seqthus recovery is significantly simplified. -
Removed unecessary backward depends-on, thus no livelock exists.
-
Removed
defered recoveryfrom recovery algo(If using per-dep commit algo). -
Simplify proof of correctness.
-
depsnow contains instance id and the instancever. -
When updating
depsfor FastAccept ofa, addxintoa.depsonly whenxdoes not knowa:x < a -
Instances by a same leader has a strong depends-on relation. A later instance always depends on a former one. This is guaranteed by handling FastAccept request sequentially.
-
Use the all-committed constrain. all-initial-value and only-to-quorum is not used.
-
R0,R1... orR[0],R[1]... : replica. -
a,b...x,y... : instance. -
La,Lb: is the leader replica of an instanceaorb. -
F: number of max allowed failed replica that. -
n: number of replicas,n = 2F+1. -
Qc: classic quorum:Qc = F+1. -
Qf: fast quorum:Qf = F+⌊Qc/2⌋ = F + ⌊(F+1)/2⌋. -
a₀: initial value of instancea. -
a₁ⁱ: updated instanceabyR[i]when it is forwarded to replicaR[i]. -
a₂: value of instanceasome relica believes to be safe. -
a ~ b: interfere:aandbcan not exchange execution order in any instance(command) sequence. -
a → b: depends-on:ainterferes withband has seenb. -
a ..→ b: indirect depends-on:ahas seenbalong a depends-on path:a → x → y ... → b. -
a < b: do-not-know:ahas not yet seenb. -
a ↦ b: exec-depends-on:ashould execute afterb.
An instance is an internal representation of a client request.
Two essential fields are:
-
cmdsthe commands a client wants to execute.
type InstanceID(ReplicaID, i64)
type Seq
type Instance {
seq: Seq;
deps: Vec<(InstanceID, Seq)>;
committed: bool;
cmds: Vec<Commands>;
ballot: BallotNum;
}
seq is something to determine order between instances.
seq must be a partial-order relation so that a after b in time ⇒ a.seq > b.seq.
In practice, seq can be simply a number and depending instance seq is
defined to be 1 plus max of dependent instance seq.
Or seq can be the set of all instance id an instance has seen, plus itself,
i.e., an instance knows of all instances its dependent instance knows.
seq is a virtual attribute and is not necessary to persist on disk.
Because it can be calculated dynamically from deps.
a.adeps: is instance id set of all instances that directly or indirectly interfering with a.
adeps is a virtual attribute and does not need to persist or recvoer it.
Because a.adeps = {a} ∪ a.deps[0].adeps ∪ a.deps[1].adeps ....
a.deps is set max intefering instances adeps of these instances.
E.g., a ~ b ~ c but a ≁ c,
a.adeps = {a, b, c}
a.deps = {b}
On implementation, a.deps is split into N subset, where N is number of replicas.
Every subset contains only instances from leader Ri:
a.deps[Ri] = {x | x.replicaID == Ri and a → x}.
And a.deps[Ri] records only the max instance id.
only works for interfering
a, x,accumulated deps may produce a fake dep:
x→a:a←b ==> a←bₐ b←xₐ <== b←x R0 R1 R2
When updating deps for FastAccept of a,
add x into a.deps only when x does not know a: x < a.
i.e., if x → a, then a < x.
Proof:
| a a<xₐ a<xₐ a←x | x x x |
| x x | |
| Qc | F |
If a is committed with a < x, there are at least Qc replicas
a < x. x commit:
- Any higher ballot must commit with
x→a. - All seen ballot can only commit with
x→a. -
x < ais not fast committed, because some process enters Accept phase withx→a, indicates that FP-condition does not hold withx < a. FP-condition always choose fast value if it could be FastCommit-ed.
∴ Without bidirection knows, a, x has at least one relation committed.
∴ a < b always hold if a, b is initiated by a same leader and a ~ b.
When handling FastAccept, need to exclude instance known but not on local replica. This guarantees only need to check the highest interfering instance when recovery.
No way. a after b requires a knows of all b knows of.
Prepare and FastAccept can be done together Because Accept only check ballot. FastAccept does not change existent FastAccept.
The entire instance space is a 3d array:
R[i][j][idx]
Fields:
- i: replicaID.
- j: replicaID.
- idx: index of a instance.
Explain:
-
R[i]: all data on replicai -
R[i][j]: instances initiated by replicajthose are stored on replicai. -
R[i][j][idx]:idx-th instance initiated by replicaj.
| |
| |
| |
| c f c f c f |
| a b e a b e a b e |
| ---------------- ---------------- ---------------- |
| leader: [0] [1] [2] [0] [1] [2] [0] [1] [2] |
| ================ ================ ================ |
| replica: R[0] R[1] R[2] |
We may write R[0] as R0 for short.
The action commit is to broadcast to all replica about what value is safe.
Some value(e.g. an instance or a relation or something else)
is safe if:
it has been forwarded to enough replicas and constituted a quorum(Qf or Qc)
so that no process(command leader or recovery process) would never choose other
value for it to commit.
In this algorithm we need to ensure two things to be safe, before committing it:
-
What to execute:
a.cmds.To commit
a.cmds, forwards it toQc=F+1replicas, becausea.cmdsnever changes. -
and when to execute:
a.deps.a.depshave different values on different replicas. This is identical to the problem fast-paxos solved: multi-value-one-round. Thus it requiresQfreplicas to have the identical value to be safe.
Since a.deps has n indepedent fields:
a.deps = {
0: x,
1: y,
...
}
-
If all
a.deps[Ri]is safe,ais safe. Then leader commit it on fast-path. -
Otherwise if any of
a.deps[Ri]is not safe, run another round of Accept to make it safe(slow-path).
Conditions must be sastisified to commit on fast-path:
-
For every updated
a.deps[i] == x, the leader received at least one reply with committeduwitha → uandu ..→ x. -
a.deps[i] == xconstitutes a fast-quorum.
TODO proof:
These two condition guarantees that x will never depends on a.
This is necessary to recover a fast-committed instance.
TODO obviously
There is only one value could be chosen to be safe.
∴ finally an instance is committed the same value on all replias.
∴ All replicas have the same set of instances.
Leader:
- Leader: Initiate instance
a: builda.deps: - Leader: FastAccept: forward
ato other replicas. - NonLeader: FastAccept: update a with local instances
- Leader: Handle-FastAcceptReply
Leader:
-
Choose
a.deps -
Send Accept to replicas
-
Handle AcceptReply
Non-leader replicas:
- Handle Accept
Just commit.
| Leader | Non Leader
| --- init | ---
| a.deps = {} |
| |
| for x in local_instances: |
| // 1. interferes |
| // 2. TODO |
| if a ~ x and x < a: |
| Lx = leaderOf(x) |
| a.deps[Lx] = max(x, a.deps[Lx]) |
| |
| forward(a) |
| --- | --- handle-fast-accept-request
| |
| | for x in local_instances:
| | if not a ~ x:
| | continue
| |
| | if x < a:
| | Lx = leaderOf(x)
| | a.deps[Lx] = max(x, a.deps[Lx])
| |
| | if x == a.deps[Lx]:
| | a.deps[Lx].adeps ∪= x.adeps
| | a.deps[Lx].committed = x.committed
| |
| | reply(a)
| --- handle-fast-accept-replies | ---
| |
| committed = []; |
| same = true |
| for repl in replies: |
| same = same and repl == replies[0] |
| for i in 0..n: |
| for repl in replies: |
| if repl.deps[i].committed |
| committed[i] = true |
| if a.deps[i] != replies[0] |
| and not committed[i]: |
| return slow_path(a) |
| |
| commit(a) |
| |
| --- accept | ---
| |
| for repl in replies: |
| for i in 0..n: |
| d = repl.deps[i] |
| if a.deps[i].instance_id < d: |
| a.deps[i].instance_id = d |
| a.deps[i].adeps ∪= d.adeps |
| |
| accept(a) |
| |
| --- | --- handle-accept
| |
| | save(a)
| |
| --- handle-accept-replies | ---
| |
| commit(a) |
| |
-
All request messages have 3 common fields:
-
req_typeidentify type: FastAccept, Accept, Commit or Prepare. -
ballotis the ballot number,- For FastAccept it is always
0. - Fast path Accept ballot is
1. - Slow path Accept ballot is
2or greater. TODO : -
ballotin Commit message is useless. -
ballotin a Prepare is chosen by recovery process and should be>2.
- For FastAccept it is always
-
instance_idis the instance id this request for.
-
-
All reply messages have 3 common fields:
-
req_type. -
last_ballotis the ballot number before processing the request. -
instance_id.
-
TODO Changes:
To fast-commit a > x:
If x is slow-committed, an Accept status x will be seen.
Thus a can be fast-committed.
If x is fast-committed:
-
If
areachedLx, thenaknow ifxis committed, becauseLxis the first to commit. Although there is chancexis committed afterareachesLx,Lxbroadcastsx is committedvery likely earlier than another instance bringsx is committedthrough its fast-accept request. -
If
adid not reachLx, thenamust have reachedg - {La, Lx}, this prevent other value ofa > yto commit. ∴a > xis safe to fast commit.
-
cmds: the commands to run. -
deps: the deps when leader initiate the instance.
-
deps: udpated deps by a replica.
-
cmds: the commands to run. -
deps: the deps chosen by leader or recovery process.
Nothing except the common fileds.
-
cmds: the commands to run. -
deps: the deps chosen by leader or recovery process.
Nothing except the common fileds.
Nothing except the common fileds.
-
committedis the committed flag of the instance on a replica. TODO
Order is defined as:
-
a.deps ⊃ b.deps: execaafterb. From Def-after, ifa.deps ⊃ b.deps, executeaafterbguarantees linearizability. -
Otherwise: exec
aandbin instance id order.
See exec-update-accumulated.md
Assumes:
- The instance to recover is
a. - The leader of
aLaisR0 - The recovery process is
P(P != R0).
After Preparing on a quorum(Qc):
-
If
PsawR0, exit and wait forR0to commita. -
If
Psaw a committeda: broadcast and quit. -
If
Psawawithballot>0: run classic paxos with this value and quit.TODO explain ballot
∴ P only need to recover if all of a it saw are in FastAccept phase.
Recovery is to choose a value of a.deps that could have been committed on
fast-path.
P tries to choose a value for a.deps[0], a.deps[1] ... one by one.
First we start to recover a.deps[1].
a.deps[La]is will never change thus do not need to recover it. TODO
After Prepare on a quorum,
P could see different values of a.deps[1](x, y...) from different replicas.
Assumes that x > y and leader of x, y is R1.
- Define
Nxto be the the number of PrepareReply witha.deps[1] == x. - Define
Nyto be the the number of PrepareReply witha.deps[1] == y. - ...
As the following diagram shows:
x ... a.deps[1]=x a.deps[1]=y
y
a z
--- --- ... --- ---
R0 R1 ... R2
R1 is unreachable, there could be two possibly committed value of a.deps[1].
E.g.:
x | a→x a a
↓ | ↓ | |
a y | y `→y `→y
--- --- | --- --- ---
R0 R1 | R2 R3 R4
La Lb
down down
x | a→u..→x a a
| ↘ ↘
a y | y v..→y w..→y
--- --- | ------- -------- --------
R0 R1 | R2 R3 R4
La Lb
down down
FastCommit requires F+Qc/2 identical FastAccept.
∴ if u did not reach La, only F+Qc/2-1 a → u ..→ x can be committed. Thus
a ..→ x can not be FastCommit-ed.
∴ There is at most one value of a that may have been FastCommit-ed.
∴ choose the value that has at least Qc/2+1.
x | a→u..→x a a
| ↘ ↘
a y | y v..→y w..→y
--- --- | ------- -------- --------
R0 R1 | R2 R3 R4
La Lb
down down
When rerun FastAccept, on R2 it found a new relation a → u ..→ z
-
If there is a
uso thatu → a, and there is no Accept-ed or Commit-eduhas been seen,a → u ..→ zcan not be committed, because this recovery has to commit withu → a. -
If all FastAccept-ed
u < a, And on replicas(R3, R4) withouta ..→ zis seen,u ..→ zcan not be committed.
-
If no value of
a.deps[1]could have FastCommit-ed: Use any value as initial value to re-run replication algo. -
Otherwise, recover all instances interfering with
aand reachable froma.deps.Then re-run replication algo from FastAccept phase.
Prepare with a new ballot and FastAccept can be combined.
There is no defer-cycle, because defering requires that:
a has > Qc/2 identical values, which requires >Qc/2 z still on these
replicas, otherwise z does not have identical value thus can not defer.
And when defering a, z, .... u thus u < a does not hold.
when found a u so that L(u) == L(a) , u always knows-of a.
And on other < Qc/2 replicas, defer chain a, z, u requires u<z<a, this
does not hold.
thus when defer chain goes to a replica twice, defer is ends.
defer chain(old interfering version):
if L(w) == L(a) w > a, then no matter what u is committed it always knows a.
∴ z must knows a
w
u
z
↙
a
↘ y