Content deleted Content added
→External links: Pruned external links to meet WP:EL. |
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation) |
||
(7 intermediate revisions by 6 users not shown) | |||
Line 32:
The leader is responsible for the log replication. It accepts client requests. Each client request consists of a command to be executed by the replicated state machines in the cluster. After being appended to the leader's log as a new entry, each of the requests is forwarded to the followers as AppendEntries messages. In case of unavailability of the followers, the leader retries AppendEntries messages indefinitely, until the log entry is eventually stored by all of the followers.
Once the leader receives confirmation from
In the case of a leader crash, the logs can be left inconsistent, with some logs from the old leader not being fully replicated through the cluster. The new leader will then handle inconsistency by forcing the followers to duplicate its own log. To do so, for each of its followers, the leader will compare its log with the log from the follower, find the last entry where they agree, then delete all the entries coming after this critical entry in the follower log and replace it with its own log entries. This mechanism will restore log consistency in a cluster subject to failures.
Line 66:
* ''electionTimeout'' is the same as described in the Leader Election section. It is something the programmer must choose.
Typical numbers for these values can be 0.5 ms to 20 ms for ''broadcastTime'', which implies that the programmer sets the ''electionTimeout'' somewhere between 10 ms and 500 ms. It can take several weeks or months between single server failures, which means the values are sufficient for a stable cluster.
== Extensions ==
The dissertation “Consensus: Bridging Theory and Practice” by one of the co-authors of the original paper describes extensions to the original algorithm:<ref>{{Cite web |title=CONSENSUS: BRIDGING THEORY AND PRACTICE|url=https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf}}</ref>
* Pre-Vote: when a member rejoins the cluster, it can depending on timing trigger an election although there is already a leader. To avoid this, pre-vote will first check in with the other members. Avoiding the unnecessary election improves the availability of cluster, therefore this extension is usually present in production implementations.
* Leadership transfer: a leader that is shutting down orderly can explicitly transfer the leadership to another member. This can be faster than waiting for a timeout. Also, a leader can step down when another member would be a better leader, for example when that member is on a faster machine.
== Production use of Raft ==
* [[CockroachDB]] uses Raft in the Replication Layer.<ref>{{Cite web |title=Replication Layer {{!}} CockroachDB Docs |url=https://www.cockroachlabs.com/docs/stable/architecture/replication-layer.html |access-date=2022-06-21 |website=www.cockroachlabs.com}}</ref>
* [[Etcd]] uses Raft to manage a highly-available replicated log <ref>{{Cite web |title=Raft README |url= https://github.com/etcd-io/
* [[Hazelcast]] uses Raft to provide its CP Subsystem, a strongly consistent layer for distributed data structures. <ref>{{Cite web |title=CP Subsystem |url=https://docs.hazelcast.com/imdg/4.2/cp-subsystem/cp-subsystem |access-date=2022-12-24 |website=docs.hazelcast.com}}</ref>
* [[MongoDB]] uses a variant of Raft in the replication set.
* [[Neo4j]] uses Raft to ensure consistency and safety. <ref>{{Cite web |title=Leadership, routing and load balancing - Operations Manual |url=https://neo4j.com/docs/operations-manual/5/clustering/setup/routing/ |access-date=2022-11-30 |website=Neo4j Graph Data Platform |language=en}}</ref>
* [[RabbitMQ]] uses Raft to implement durable, replicated FIFO queues. <ref>{{Cite web |title=Quorum Queues |url=https://www.rabbitmq.com/quorum-queues.html |access-date=2022-12-14 |website=RabbitMQ |language=en}}</ref>
* [[ScyllaDB]] uses Raft for metadata (schema and topology changes) <ref>{{Cite web |title=
* [[Splunk]] Enterprise uses Raft in a Search Head Cluster (SHC) <ref>{{Cite web |date=2022-08-24| title= Handle Raft issues|url=https://docs.splunk.com/Documentation/Splunk/9.0.0/DistSearch/Handleraftissues| access-date=2022-08-24|website=Splunk |language=en-US}}</ref>
* [[TiDB]] uses Raft with the storage engine TiKV.<ref>{{Cite web |date=2021-09-01 |title=Raft and High Availability |url=https://en.pingcap.com/blog/raft-and-high-availability/ |access-date=2022-06-21 |website=PingCAP |language=en-US}}</ref>
* [[YugabyteDB]] uses Raft in the DocDB Replication <ref>{{Cite web |title=Replication {{!}} YugabyteDB Docs |url=https://docs.yugabyte.com/preview/architecture/docdb-replication/replication/ |access-date=2022-08-19|website=www.yugabyte.com}}</ref>
* [[ClickHouse]] uses Raft for in-house implementation of [[ZooKeeper]]-like service <ref>{{Cite web |title=ClickHouse Keeper |url=https://clickhouse.com/docs/en/guides/sre/keeper/clickhouse-keeper |access-date=2023-04-26|website=clickhouse.com}}</ref>
*
* [[Apache Kafka]] Raft (KRaft) uses Raft for metadata management.<ref>{{Cite web |title=KRaft Overview {{!}} Confluent Documentation |url=https://docs.confluent.io/platform/current/kafka-metadata/kraft.html |access-date=2024-04-13 |website=docs.confluent.io}}</ref>
* [[NATS Messaging]] uses the Raft consensus algorithm for Jetstream cluster management and data replication <ref>{{Cite web |title=JetStream Clustering |url=https://docs.nats.io/running-a-nats-service/configuration/clustering/jetstream_clustering}}</ref>
Line 89 ⟶ 96:
| last1 = Ongaro
| first1 = Diego
|
| last2 = Ousterhout
| first2 = John
| year = 2013
|