Content deleted Content added
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation) |
|||
(97 intermediate revisions by 73 users not shown) | |||
Line 1:
{{Short description|Consensus algorithm}}
'''Raft''' is a [[Consensus (computer science)|consensus]] algorithm designed as an alternative to [[Paxos (computer science)|Paxos]]. It was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features.<ref name="paper"/> Raft offers a generic way to distribute a [[Finite-state machine|state machine]] across a [[Computer cluster|cluster]] of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions. It has a number of open-source reference implementations, with full-spec implementations in [[Go (programming language)|Go]], [[C++]], [[Java (programming language)|Java]], and [[Scala (programming language)|Scala]].<ref name="website" />▼
{{Infobox algorithm
|class = [[Consensus algorithm]]
|image = Raft Consensus Algorithm Mascot on transparent background.svg
|imagesize =
|caption = The Raft consensus algorithm mascot.
}}
▲'''Raft''' is a [[Consensus (computer science)|consensus]] algorithm designed as an alternative to the [[Paxos (computer science)|Paxos]] family of algorithms. It was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features.<ref name="paper"/> Raft offers a generic way to distribute a [[Finite-state machine|state machine]] across a [[Computer cluster|cluster]] of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions. It has a number of open-source reference implementations, with full-
Raft is not a [[Byzantine fault]] tolerant (BFT) algorithm; the nodes trust the elected leader.<ref name="paper"/>
== Basics ==
Raft achieves consensus via an elected leader. A server in a raft cluster is either a ''leader'' or a ''follower'', and can be a ''candidate'' in the precise case of an election (leader unavailable)
=== Approach of the consensus problem in Raft ===
Raft implements consensus by a leader approach. The
The consensus problem is decomposed in Raft into
==== Leader
When the existing leader fails or when
In this case, a new ''term'' starts in the cluster. A term is an arbitrary period of time on the server
A leader election is started by a ''candidate'' server. A server becomes a candidate if it receives no communication by the leader over a period called the ''election timeout'', so it assumes there is no acting leader anymore. It starts the election by increasing the term counter,
Raft uses a randomized election timeout to ensure that split
==== Log
The leader is responsible for the log replication. It accepts client requests.
Once the leader receives confirmation from half or more of its followers that the entry has been replicated, the leader applies the entry to its local state machine, and the request is considered ''committed''.<ref name="paper" /><ref name="secretlives" /> This event also commits all previous entries in the leader's log. Once a follower learns that a log entry is committed, it applies the entry to its local state machine. This ensures consistency of the logs between all the servers through the cluster, ensuring that the safety rule of Log Matching is respected.
==== Safety ====▼
Raft guarantees each of these safety properties :▼
In the case of a leader crash, the logs can be left inconsistent, with some logs from the old leader not being fully replicated through the cluster. The new leader will then handle inconsistency by forcing the followers to duplicate its own log. To do so, for each of its followers, the leader will compare its log with the log from the follower, find the last entry where they agree, then delete all the entries coming after this critical entry in the follower log and replace it with its own log entries. This mechanism will restore log consistency in a cluster subject to failures.
* '''Election safety:''' at most on leader can be elected in a given term.▼
* '''Leader Append-Only:''' a leader can only appends new entries to its logs (it can't overwrite neither delete entries).▼
* '''Log Matching:''' if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.▼
* '''Leader Completeness:''' if a log entry is committed in a given term then it will be present in the logs of the leaders since this term▼
==== Safety rules in Raft ====
* '''State Machine Safety:''' if a server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log. ▼
▲* '''Leader
▲* '''Log
▲* '''Leader
▲* '''State
The first four rules are guaranteed by the details of the algorithm described in the previous section. The State Machine Safety is guaranteed by a restriction on the election process.
==== State machine safety ====
This rule is ensured by a simple restriction: a candidate can't win an election unless its log contains all committed entries. In order to be elected, a candidate has to contact a majority of the cluster, and given the rules for logs to be committed, it means that every committed entry is going to be present on at least one of the servers the candidates contact.
Raft determines which of two logs (carried by two distinct servers) is more up-to-date by comparing the index term of the last entries in the logs. If the logs have a last entry with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
In Raft, the request from a candidate to a voter includes information about the candidate's log. If its own log is more up-to-date than the candidate's log, the voter denies its vote to the candidate. This implementation ensures the State Machine Safety rule.
==== Follower crashes ====
If a follower crashes, ''AppendEntries'' and ''vote'' requests sent by other servers will fail. Such failures are handled by the servers trying indefinitely to reach the downed follower. If the follower restarts, the pending requests will complete. If the request has already been taken into account before the failure, the restarted follower will just ignore it.
==== Timing and availability ====
Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the ''timing requirement'' of the algorithm:
<blockquote>''broadcastTime << electionTimeout << MTBF''</blockquote>
* ''broadcastTime'' is the average time it takes a server to send a request to every server in the cluster and receive responses. It is relative to the infrastructure used.
* ''MTBF (Mean Time Between Failures)'' is the average time between failures for a server. It is also relative to the infrastructure.
* ''electionTimeout'' is the same as described in the Leader Election section. It is something the programmer must choose.
Typical numbers for these values can be 0.5 ms to 20 ms for ''broadcastTime'', which implies that the programmer sets the ''electionTimeout'' somewhere between 10 ms and 500 ms. It can take several weeks or months between single server failures, which means the values are sufficient for a stable cluster.
== Extensions ==
The dissertation “Consensus: Bridging Theory and Practice” by one of the co-authors of the original paper describes extensions to the original algorithm:<ref>{{Cite web |title=CONSENSUS: BRIDGING THEORY AND PRACTICE|url=https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf}}</ref>
* Pre-Vote: when a member rejoins the cluster, it can depending on timing trigger an election although there is already a leader. To avoid this, pre-vote will first check in with the other members. Avoiding the unnecessary election improves the availability of cluster, therefore this extension is usually present in production implementations.
* Leadership transfer: a leader that is shutting down orderly can explicitly transfer the leadership to another member. This can be faster than waiting for a timeout. Also, a leader can step down when another member would be a better leader, for example when that member is on a faster machine.
== Production use of Raft ==
* [[CockroachDB]] uses Raft in the Replication Layer.<ref>{{Cite web |title=Replication Layer {{!}} CockroachDB Docs |url=https://www.cockroachlabs.com/docs/stable/architecture/replication-layer.html |access-date=2022-06-21 |website=www.cockroachlabs.com}}</ref>
* [[Etcd]] uses Raft to manage a highly-available replicated log <ref>{{Cite web |title=Raft README |url= https://github.com/etcd-io/raft/blob/main/README.md |access-date=2022-08-25|website=github.com}}</ref>
* [[Hazelcast]] uses Raft to provide its CP Subsystem, a strongly consistent layer for distributed data structures. <ref>{{Cite web |title=CP Subsystem |url=https://docs.hazelcast.com/imdg/4.2/cp-subsystem/cp-subsystem |access-date=2022-12-24 |website=docs.hazelcast.com}}</ref>
* [[MongoDB]] uses a variant of Raft in the replication set.
* [[Neo4j]] uses Raft to ensure consistency and safety. <ref>{{Cite web |title=Leadership, routing and load balancing - Operations Manual |url=https://neo4j.com/docs/operations-manual/5/clustering/setup/routing/ |access-date=2022-11-30 |website=Neo4j Graph Data Platform |language=en}}</ref>
* [[RabbitMQ]] uses Raft to implement durable, replicated FIFO queues. <ref>{{Cite web |title=Quorum Queues |url=https://www.rabbitmq.com/quorum-queues.html |access-date=2022-12-14 |website=RabbitMQ |language=en}}</ref>
* [[ScyllaDB]] uses Raft for metadata (schema and topology changes) <ref>{{Cite web |title=ScyllaDB's Path to Strong Consistency: A New Milestone |date=4 May 2023 |url=https://www.scylladb.com/2023/05/04/scylladbs-path-to-strong-consistency-a-new-milestone/}}</ref>
* [[Splunk]] Enterprise uses Raft in a Search Head Cluster (SHC) <ref>{{Cite web |date=2022-08-24| title= Handle Raft issues|url=https://docs.splunk.com/Documentation/Splunk/9.0.0/DistSearch/Handleraftissues| access-date=2022-08-24|website=Splunk |language=en-US}}</ref>
* [[TiDB]] uses Raft with the storage engine TiKV.<ref>{{Cite web |date=2021-09-01 |title=Raft and High Availability |url=https://en.pingcap.com/blog/raft-and-high-availability/ |access-date=2022-06-21 |website=PingCAP |language=en-US}}</ref>
* [[YugabyteDB]] uses Raft in the DocDB Replication <ref>{{Cite web |title=Replication {{!}} YugabyteDB Docs |url=https://docs.yugabyte.com/preview/architecture/docdb-replication/replication/ |access-date=2022-08-19|website=www.yugabyte.com}}</ref>
* [[ClickHouse]] uses Raft for in-house implementation of [[ZooKeeper]]-like service <ref>{{Cite web |title=ClickHouse Keeper |url=https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-keeper |access-date=2023-04-26|website=clickhouse.com}}</ref>
* Redpanda uses the Raft consensus algorithm for data replication <ref>{{Cite web |title=Raft consensus algorithm |url=https://docs.redpanda.com/docs/get-started/architecture/#raft-consensus-algorithm}}</ref>
* [[Apache Kafka]] Raft (KRaft) uses Raft for metadata management.<ref>{{Cite web |title=KRaft Overview {{!}} Confluent Documentation |url=https://docs.confluent.io/platform/current/kafka-metadata/kraft.html |access-date=2024-04-13 |website=docs.confluent.io}}</ref>
* [[NATS Messaging]] uses the Raft consensus algorithm for Jetstream cluster management and data replication <ref>{{Cite web |title=JetStream Clustering |url=https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering}}</ref>
* [[Camunda]] uses the Raft consensus algorithm for data replication <ref>{{Cite web |title=Raft consensus and replication protocol |url=https://docs.camunda.io/docs/components/zeebe/technical-concepts/clustering/#raft-consensus-and-replication-protocol}}</ref>
== References ==
{{Reflist|refs=
<ref name="paper">{{cite web
|
|
|
| last2 = Ousterhout
| first2 = John
| year = 2013
| title = In Search of an Understandable Consensus Algorithm
| url = https://
}}</ref>
<ref name="website">{{cite web
Line 48 ⟶ 111:
| title = Raft: Understandable Distributed Consensus
| url = http://thesecretlivesofdata.com/raft/
| accessdate=
| work= The Secret Lives of Data website
| author = Ben B. Johnson
}}</ref>
}}
Line 54 ⟶ 119:
==External links==
*{{Official website}}
{{DEFAULTSORT:Raft Algorithm}}
[[Category:Distributed algorithms]]
[[Category:Fault-tolerant computer systems]]
|