-
Notifications
You must be signed in to change notification settings - Fork 2
replication 3
- Remove infinite strongly-connected-components and livelock.
- Remove
seq. - Remove
deferduring recovery.
Major changes from epaxos:
-
When updating
deps, setdepsto accumulated deps, which includes all reachable instances by walking through the dep-graph. -
Removed fast-quorum, use only one quorum
Q=F+1 -
A leader(default leader or recovery leader) forwards only F messages.
-
deps_committedis removed, fast-commit does not relies on committed flag. e.g. FP-condition is removed too. -
initial_depsis useless and removed: recovery does not needinitial_deps. onlydeps. -
final_depsis useless and removed.
-
R0,R1... orR[0],R[1]... : replica. -
a,b...x,y... : instance. -
La,Lb: is the leader replica of an instanceaorb. -
F: number of max allowed failed replica that. -
N: number of replicas,N = 2F+1. -
Q: quorum:Q = F+1. -
a₀: initial value of instancea. -
a₁ⁱ: updated instanceabyR[i]when it is forwarded to replicaR[i]. -
a₂: value of instanceasome relica believes to be safe. -
→: depends-on:a → bmeansadepends-onb.
An instance is an internal representation of a client request.
Two essential fields are:
-
cmdsthe commands a client wants to execute. -
depswhat other instance it sees before being committed. This field is to determine the instance execution order.
type InstanceID(ReplicaID, i64)
type Instance {
deps: Vec<InstanceID>;
committed: bool;
executed: bool;
cmds: Vec<Commands>;
ballot: BallotNum;
}
-
a.deps: is instance id setadepends-on, whenais created on leader or forwarded to other replica.
Given two interfering instances a and b:
a depends-on b(or a → b):
if a knows the existence of b.
There could be a cycle along several committed instances:
a → b → c → a.
Because an instance may update its own deps:
e.g., initially, a → b → c → ø, then c updates its deps with c → a, but
a and b is not updated.
TODO move to where?
Our algo must meet following guarantees to make consensus:
Execution consistency:
If two interfering commands a and b are successfully committed,
they will be executed in the same order by every replica.
Every instance must be executed in finite time, e.g. no livelock with a SCC.
Execution linearizability:
If two instances are serialized by client(a is proposed only after b is
committed by any replica), then every replica will execute b before a.
For two instances a, b,
a ~ b if there is an instance sequence x, ... a, ... b, ...y, that has
different execution results if exchange position of a, b.
An indirect graph Gi of interfering relations of instances.
Commit is to broadcast a value to every replica. A replica always accept a value in a Commit request.
To satisfy G-exec-consistency, two conflicting values must not be both committed.
A value is safe if no other conflicting value will be chosen for Commit.
∴ two processes choosing conflicting values must perceive the other's choice.
∴ before Commit a value, a process have to broadcast its chosen value. And retrieve others chosen value.
-
A chosen value must be seen.
- A chosen value must constitute a quorum
Q.
- A chosen value must constitute a quorum
-
To prevent multiple values to be chosen: If two values have been seen, none of them could have been chosen.
Two interfering instance must depends-on other: → an instance must be replicate to a quorum to be seen.
TODO only execute committed instance.
To satisfy G-exec-linear, a is proposed only after b is
committed by any replica
When b commit with the interfering-graph Gi it saws.
From Def-safe, a committed value b constitutes a quorum.
∴ a is able to see the safe value of Gi of b: b.Gi.
When a is committed, it is committed with a bigger a.Gi | a.Gi ⊃ b.Gi.
∴ That if a.Gi ⊃ b.Gi then execute a after b, satisfies
G-exec-linear.
∴ Define after to a knows all b knows.
a.Gi ⊃ b.Gi: a after b.
And by comparing x.Gi of instances the order satisfies G-exec-linear.
a.deps: is defined as: all instances a directly or indirectly depends-on, including a itself.
-
depswhat other instance it sees before being committed. This field is to determine the instance execution order. -
a.deps: is instance id setadepends-on, whenais created on leader or forwarded to other replica.
To satisfy G-exec-consistency, two interfering instances must have consistent order on every replica.
This implies the one to execute after should be aware of the early one.
∴ for a, b, before commit, there must be at least one of them knows of the
other, i.e., a dont know b(a < b) and b dont know a(b < a) must not both committed.
Round-1:
∴ Round-1: the first round
a broadcast a < b,
b broadcast b < a,
to a quorum.
When a replica receives an instance, save it locally,
unless there is b exists.
If b exists, save a < b+1 and respond negative.
If the leader receives quorum of postive response, Fast-Commit can be done.
In order to recover,
recovery process must be able to identify if a < b is committed.
a < b and b < a are conflicting values.
∴ there wont be two quorum: Q₁: a < b, Q₂: b < a: Q₁ ∩ Q and Q₂ ∩ Q
∴ If La ∉ Q and Lb ⊄ Q, |Q₁ - {a} ∩ Q| + |Q₂ - {b} ∩ Q| > |Q|
Q = 5, F = 4, fq = 6
a | a<x a<x a<x a>x a>x | x
--- --- | --- --- --- --- --- | --- ---
La Lx
down down
∴ fq = F + ⌊(F+1) / 2⌋
If La ⊄ Q but Lb ∈ Q, Lb does not provide useful info about which is committed.
We must wait for b to commit.
If b is committed with b < a, then only a → b can be committed.
If b is committed with b → a, then we can not tell which is committed.
| a→x a←x a←x |
| |
a | |
--- --- | --- --- --- |
La Lx
down down
Ballot:
Since there could be multiple leaders operating on this instance a, leaders
need Ballot to identify itself.
And at any time, there must be at most only 1 leader can proceed.
-
The initial leader, i.e., that proposed instance
a, usesballot=0. -
A recovery leader, when the initial leader failed to commit
a, takes leadership by increment Ballot on a quorum of replicas. A replica must reject request from an old leader.
As received replies from quorum, leader of a union them into the Gi to
commit.
Because any seen instance may not know of a yet.
Round-2: Next, La broadcast the union Gi to a quorum to make it safe.
Round-3: Then commit.
If Round-1 received identical replies from quorum, it could choose to run Round-3 without Round-2.
Another leader for recovery must commit the same Gi if previous leader already
committed.
Recovery-1 A recovery leader first broadcast new ballot to a quorum to take leadership so that old leader can not proceed.
Within this step, a replica saves the new ballot locally if its ballot is smaller. Then it responds to the new leader with the instance.
If the recovery leader sees a Committed instance, it just broadcast the instance.
If it sees an instance received Round-2, it choose the one with greatest ballot and re-run Round-2 then commit.
If it sees only Round-1 instance, the system must guarantees that committed instance can be see.
Rcv1-msg:
This requires that in the quorum, if two different instances are seen, no
Fast-Commit can be done.
∴ Round-1 sends Round-1 messages to at most Q replicas, including the
leader.
∴ the recovery leader can see only one value that could have done Fast-Commit.
Recovery is similar to initial leader replication:
Because it also need to check against others Gi, it should
run Round-1 again on recovery quorum.
After Round-1, recovery leader may discover different value of a.Gi₁,
In this case recovery leader have to decide whether the previous value is
committed.
For an instance z ∈ a.Gi₁ and z ∈ a.Gi, assumes leader of z is Lz.
If there are more than one a.Gi,
Run Round-2 and then commit,
because no other value z could
commit without knowing a.
If z ↛ a, a.Gi₁ must be committed, because from TODO, two interfering
instance must have on relation.
If Lz is not seen:
∴ z ↛ a, a.Gi must not be committed. Because Lz does not know a.
In this case, commit a → z.
∴ disallow Lz to accept z → a
∴ In Round-1, if x → a, forbid a → x. TODO elaborate it.
If z → a, a does not need to have z in its Gi,
in this case to commit a.Gi is safe.
∴ for a replica: if z → a, should not update with a → z.
If Lz is seen:
Replicate:
a replicates that a does not know x: a < x.
x replicates that x does not know a: x < a.
Since the order is determined by a.deps.
∴ a.deps must be safe to be committed.
From Def-after, interfering a b must see each other
a proposed after b must reaches a quorum to perceive the safe value b.
∴ first round to contact a to a quorum, to see what instances there are and
what they knows.
Then choose value a knows to be union of the response.
Broadcast the value to a quorum to make it safe.
But two process may
record what a knows of, and
respond it to leader.
From Def-itf-order, two interfering
FastAccept requires a → x and a ↛ x to be exclusive
∴ FastAccept request can not be handled twice, or two process may believes their
value constituted a quorum.
∴ Since a.deps use only the max id to describe deps, FastAccept of older instance must be handled before newer ones.
Otherwise, for a replica it feels like handled an older instance twice.
If x → a,
TODO use baohai's exec-algo requires newer instance depends-on older one? even if they do not interfere.
Two interfering instances by a same leader must be handled in FIFO order on a replica.
E.g. if b is replciated before a, a and x finally has the same deps,
thus there is no way to tell which is earlier:
b
a x
--- --- ---
R0 R1 R2
b----> b
a x
--- --- ---
R0 R1 R2
b b
a x <----x // R1: x→{b, x}
--- --- ---
R0 R1 R2
b b
a----> a x x // R1: x→{b, x}, a→{b, x}
--- --- ---
R0 R1 R2
Processing instances in FIFO order is simple in impl.
Two interfering instances a and b has one of two relation:
a→b or a↛b.
The Same for b.
Thus there 3 relation between a and b:
a→b, b→a, a↔b.
The entire instance space is a 3d array:
R[i][j][idx]
Fields:
- i: replicaID.
- j: replicaID.
- idx: index of a instance.
Explain:
-
R[i]: all data on replicai -
R[i][j]: instances initiated by replicajthose are stored on replicai. -
R[i][j][idx]:idx-th instance initiated by replicaj.
| |
| |
| |
| c f c f c f |
| a b e a b e a b e |
| ---------------- ---------------- ---------------- |
| leader: [0] [1] [2] [0] [1] [2] [0] [1] [2] |
| ================ ================ ================ |
| replica: R[0] R[1] R[2] |
We may write R[0] as R0 for short.
-
Initially, there are 3 instances
x, y, z.When
ais initiated onR0,adepends-on all others:a₀ → {x, y, z}.When
bis initiated onR1,bdepends-on all others:b₀ → {x, y, z}.When
cis initiated onR2,cdepends-on all others:c₀ → {x, y, z}.When
dis initiated onR0,ddepends-on all others:d₀ → {a, x, y, z}.d ↓ a b c x y z x y z x y z ----- ----- ----- R0 R1 R2
When d is replicated to R1,
R1 believes that d₁¹ → {a, b, x, y, z}.
d₁¹ got a new relation d₁¹ → b:
d d
↓ ↘
a b c
x y z x y z x y z
----- ----- -----
R0 R1 R2
Then c is replicated to R1,
R1 believes that c₁¹ → {d, a, b, x, y, z}.
c₁¹ got three new relations c₁¹ → {b, d, a}(
because R1 believes d → a thus c₁¹ → a):
.c
↙ |
d d |
↓ ↘ ↙
a b c
x y z x y z x y z
----- ----- -----
R0 R1 R2
Then a is replicated to R1,
R1 believes that a₁¹ → {b, x, y, z}.
a₁¹ got only one new relation a₁¹ → b:
R1 already believes d₀ → a because it had received d₀ from R0.
c₁¹ → d thus c₁¹ → a.
.c
↙ |
d d |
↓ ↓↘ ↙
a a→b c
x y z x y z x y z
----- ----- -----
R0 R1 R2
Starts with a new initial setup:
d
↓↘
a c b
x y z x y z x y z
----- ----- -----
R0 R1 R2
After forwarding d to R1:
d₁¹ = d₀ → {a, c, z}
d d
↓↘
a c b
x y z x y z x y z
----- ----- -----
R0 R1 R2
Then b is forwarded to R1:
b did not see a and c,
but b still updates with three new relations:
ḇ₁¹ → {d, a, c}.
Because d → {a, c} and deps is transitive.
b
↙
d d
↓↘
a c b
x y z x y z x y z
----- ----- -----
R0 R1 R2
We see that different replicas have their own view of instance relations.
TODO
-
On a replica, If
a → bhas been seen, thenb → adoes not hold. -
On a replica,
a > anever holds.
On a replica,
a → b and b → c implies a → c.
-
a.deps: is instance id set whenais created on leader. whenais forwarded to other replica, it is updated instnce id set.
On a replica:
a.deps is all instances that a depends-on:
a.deps = {x | a → x}.
On implementation,
a.deps is split into N subset,
where N is number of replicas.
Every subset contains only instances from leader Ri:
a.deps[Ri] = {x | x.replicaID == Ri and a → x}.
TODO replace this section with definition of after
On a replica:
-
a → bimpliesa.deps ⊃ b.deps. -
Thus
a.deps ⊂ b.depsthena < bdoes not hold.
a.deps[i] stores only the max instance id in it(that is why FastAccept must
be handled in instance id order, otherwise recording only the max instance id
includes more instances),
because an instance is after all preceding instances by the same leader.
The action commit is to broadcast to all replica about what value is safe.
a is safe if every a.deps[Ri] is safe.
a₁¹ a₁² a₁³
| | |↘
| | | c₁³
↓ ↓ ↓
a₀ b₀ c₀ b₀
--- --- --- --- ---
R0 R1 R2 R3 R4
a₁¹.deps = {b}
a₁².deps = {c}
a₁³.deps = {b, c}
Thus a.deps = {b, c} can be committed.
In this algorithm we need to ensure two things to be safe, before committing it:
-
What to execute:
a.cmds.To commit
a.cmds, forwards it toQ=F+1replicas, becausea.cmdsnever changes. -
and when to execute:
a.deps.a.depshave different values on different replicas. Thus it requiresQreplicas to have the identical value to be safe.
Since a.deps has N indepedent fields:
a.deps = {
0: x,
1: y,
...
}
-
If every
a.deps[Ri]is safe,ais safe. Then leader commit it on fast-path. -
Otherwise if any of
a.deps[Ri]is not safe, run another round of Accept to make it safe(slow path).
the value of two interfering instance a→b, can only be commit when it is safe.
A safe value requires a quorum.
Two quorums must have at least one common replica.
There is only one value could be chosen to be safe.
∴ no two different value, e.g., a→b and a↛b could be both committed.
∴ finally an instance is committed the same value on all replias.
∴ All replicas have the same set of instances.
Leader:
-
Initiate instance
abuild
a.deps:max_known_instance_id[leaderOf(a)] = a // N is the number of all leaders for l in (0..N): a.deps[l] = max_known_instance_id[l]; -
FastAccept: forward
ato other replicas. -
Handle-FastAcceptReply
Update
a.deps:for i in 0..N: values = {a.deps[i] for a in all_replies} // received different values. if (count(v[0], values) != Q-1): return quit_fast_path() commit(a)
Non-leader replicas:
-
Handle-FastAccept
If FastAccept is already handled, ignore all future FastAccept request.
TODO need proof of linearizability with this. TODO explain why this is efficient reducing conflict.
TODO allow or not allow backward depends-on is an option. By disallowing backward depends-on, baohais's exec algo would work.
update
a.deps'.committed flag are ignored in this pseudo code for clarity
for x in all_instances_on_this_repilca: if (x ~ a): l = leaderOf(x) a.deps[l] = max(x, a.deps[l]) reply(a)
Leader:
-
Choose
a.deps -
Send Accept to replicas
-
Handle AcceptReply
Non-leader replicas:
- Handle Accept
Just commit.
-
All request messages have 3 common fields:
-
ballotis the ballot number,-
the leader of an instance use
ballot=0. -
the recovery process use
ballot > 0.A recovery process is actually another leader that takes leadership by increment ballot.
-
ballotin Commit message is useless.
-
-
instance_idis the instance id this request for.
-
-
All reply messages have 3 common fields:
-
last_ballotis the ballot number before processing the request. -
instance_id.
-
-
cmds: the commands to run. -
deps: the deps when leader initiate the instance.
-
deps: udpated deps by a replica.
-
cmds: the commands to run. -
deps: the deps chosen by leader or recovery process.
Nothing except the common fileds.
-
cmds: the commands to run. -
deps: the deps chosen by leader or recovery process.
Nothing except the common fileds.
Nothing except the common fileds.
-
an instance.
TODO
Order is defined as:
-
a.deps ⊃ b.deps: execaafterb. From Def-after, ifa.deps ⊃ b.deps, executeaafterbguarantees linearizability. -
Otherwise: exec
aandbin instance id order.
In the following digram, a.deps ⊃ d.deps thus a should be executed after
d.
b and e interferes but b.deps ⊅ e.deps or e.deps ⊅ b.deps.
They could be executed in any order that is identical on every replica.
a b
↘ ↙ ~
c ~ d ~ e
↘ ↙ ↘
f g
One general exec algo is by walking the depends-on graph, remove some edges to reduce the graph to a DAG and instances have determined order to execute.
See other doc TODO
Another exec algo is much simpler to proof correctness but requires additional constrains: for instances on a leader, a newer instance must be executed after an older instance.
Our replication algo One of the constrain must be applied to replication:
- A replica must handle older instance before handling a newer one.
- Or the
depsof an older instance must not includedepsof the newer instance, when handling FastAccept.
Either one of the above guarantees newer.deps ⊃ older.deps.
See other doc TODO
Assumes:
- The instance to recover is
a. - The leader of
aLaisR0 - The recovery process is
P(P != R0).
After Preparing on a quorum(Q=F+1):
-
If
PsawR0, exit and wait forR0to commita. -
If
Psaw a committeda: broadcast and quit. -
If
Psawawithballot>0: run classic paxos with this value and quit.TODO explain ballot
∴ P only need to recover if all of a it saw are in FastAccept phase.
Recovery is to choose a value of a.deps that could have been committed on
fast-path.
Choose only the values with highest ballot seen.
P tries to choose a value for a.deps[0], a.deps[1] ... one by one.
Assumes we start to recover a.deps[1].
After Prepare on a quorum,
P may see different values of a.deps[1], e.g., x, y... on different replicas.
As the following diagram shows:
x ... a.deps[1]=x a.deps[1]=y
a y
--- --- ... --- ---
R0 R1 ... R2
∵ Leader sends exactly F FastAccept message(including the leader, at most Q=F+1 replica deps this instance).
∴ If there are two different value of a.deps[1], a can not fast-commit.
∴ P choose the only value seen or the highest value.
-
If
Lxis not reached:x | a→x | ↓ | ↓ | a y | y | --- --- | --- --- --- | R0 R1 | R2 R3 R4 | La Lx down downUse the the instance id
Pchosen as an initialdepsto run FastAccept with the recovery ballot, e.g.,ballot=1to recovery quorum.If new
depszby leaderLxis found:If there is
z→a, then Lx hasz→athen there wont bez↛aexists. Commita→x.Otherwise, on
Lxthere must bez↛a, which meansa→xis not committed because there is not enough quorum. Then run Accept and Commit witha→z.This is the same as the replication procedure on the leader, except ballot is not 0.
TODO proof recovery from a recovery.
-
If
Lxis reached:| x a→x | | ↓ ↓ | a | y y | --- | --- --- --- | --- R0 | R1 R2 R3 | R4 La Lx down downIf new
depszby leaderLxis found:wait z to commit, if z is committed with
z→a, commita→x. if z is committed withz↛a, there is an unreachable replica does not havea, which meansa→xis not fast-committed. Accept and Commita→z.
Collect all recovered values of a.deps[i] and run Accept and Commit.