Categorygithub.com/hashicorp/raft
modulepackage
1.7.2
Repository: https://github.com/hashicorp/raft.git
Documentation: pkg.go.dev

# README

raft Build Status

raft is a Go library that manages a replicated log and can be used with an FSM to manage replicated state machines. It is a library for providing consensus.

The use cases for such a library are far-reaching, such as replicated state machines which are a key component of many distributed systems. They enable building Consistent, Partition Tolerant (CP) systems, with limited fault tolerance as well.

Building

If you wish to build raft you'll need Go version 1.16+ installed.

Please check your installation with:

go version

Documentation

For complete documentation, see the associated Godoc.

To prevent complications with cgo, the primary backend MDBStore is in a separate repository, called raft-mdb. That is the recommended implementation for the LogStore and StableStore.

A pure Go backend using Bbolt is also available called raft-boltdb. It can also be used as a LogStore and StableStore.

Community Contributed Examples

Tagged Releases

As of September 2017, HashiCorp will start using tags for this library to clearly indicate major version updates. We recommend you vendor your application's dependency on this library.

  • v0.1.0 is the original stable version of the library that was in main and has been maintained with no breaking API changes. This was in use by Consul prior to version 0.7.0.

  • v1.0.0 takes the changes that were staged in the library-v2-stage-one branch. This version manages server identities using a UUID, so introduces some breaking API changes. It also versions the Raft protocol, and requires some special steps when interoperating with Raft servers running older versions of the library (see the detailed comment in config.go about version compatibility). You can reference https://github.com/hashicorp/consul/pull/2222 for an idea of what was required to port Consul to these new interfaces.

    This version includes some new features as well, including non voting servers, a new address provider abstraction in the transport layer, and more resilient snapshots.

Protocol

raft is based on "Raft: In Search of an Understandable Consensus Algorithm"

A high level overview of the Raft protocol is described below, but for details please read the full Raft paper followed by the raft source. Any questions about the raft protocol should be sent to the raft-dev mailing list.

Protocol Description

Raft nodes are always in one of three states: follower, candidate or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for some time, nodes self-promote to the candidate state. In the candidate state nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers. In addition, if stale reads are not acceptable, all queries must also be performed on the leader.

Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry, which is an opaque binary blob to Raft. The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific, and is implemented using an interface.

An obvious question relates to the unbounded nature of a replicated log. Raft provides a mechanism by which the current state is snapshotted, and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time, and then remove all the logs that were used to reach that state. This is performed automatically without user intervention, and prevents unbounded disk usage as well as minimizing time spent replaying logs.

Lastly, there is the issue of updating the peer set when new servers are joining or existing servers are leaving. As long as a quorum of nodes is available, this is not an issue as Raft provides mechanisms to dynamically update the peer set. If a quorum of nodes is unavailable, then this becomes a very challenging issue. For example, suppose there are only 2 peers, A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add, or remove a node, or commit any additional log entries. This results in unavailability. At this point, manual intervention would be required to remove either A or B, and to restart the remaining node in bootstrap mode.

A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 raft servers. This maximizes availability without greatly sacrificing performance.

In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, committing a log entry requires a single round trip to half of the cluster. Thus performance is bound by disk I/O and network latency.

Metrics Emission and Compatibility

This library can emit metrics using either github.com/armon/go-metrics or github.com/hashicorp/go-metrics. Choosing between the libraries is controlled via build tags.

Build Tags

  • armonmetrics - Using this tag will cause metrics to be routed to armon/go-metrics
  • hashicorpmetrics - Using this tag will cause all metrics to be routed to hashicorp/go-metrics

If no build tag is specified, the default behavior is to use armon/go-metrics.

Deprecating armon/go-metrics

Emitting metrics to armon/go-metrics is officially deprecated. Usage of armon/go-metrics will remain the default until mid-2025 with opt-in support continuing to the end of 2025.

Migration To migrate an application currently using the older armon/go-metrics to instead use hashicorp/go-metrics the following should be done.

  1. Upgrade libraries using armon/go-metrics to consume hashicorp/go-metrics/compat instead. This should involve only changing import statements. All repositories in the hashicorp namespace
  2. Update an applications library dependencies to those that have the compatibility layer configured.
  3. Update the application to use hashicorp/go-metrics for configuring metrics export instead of armon/go-metrics
    • Replace all application imports of github.com/armon/go-metrics with github.com/hashicorp/go-metrics
    • Instrument your build system to build with the hashicorpmetrics tag.

Eventually once the default behavior changes to use hashicorp/go-metrics by default (mid-2025), you can drop the hashicorpmetrics build tag.

# Packages

# Functions

BootstrapCluster initializes a server's storage with the given cluster configuration.
DecodeConfiguration deserializes a Configuration using MsgPack, or panics on errors.
DefaultConfig returns a Config with usable defaults.
EncodeConfiguration serializes a Configuration using MsgPack, or panics on errors.
NOTE: This is exposed for middleware testing purposes and is not a stable API.
GetConfiguration returns the persisted configuration of the Raft cluster without starting a Raft instance or connecting to the cluster.
HasExistingState returns true if the server has any existing state (logs, knowledge of a current term, or any snapshots).
NOTE: This is exposed for middleware testing purposes and is not a stable API.
NOTE: This is exposed for middleware testing purposes and is not a stable API.
NOTE: This is exposed for middleware testing purposes and is not a stable API.
NewDiscardSnapshotStore is used to create a new DiscardSnapshotStore.
NewFileSnapshotStore creates a new FileSnapshotStore based on a base directory.
NewFileSnapshotStoreWithLogger creates a new FileSnapshotStore based on a base directory.
NewInmemAddr returns a new in-memory addr with a randomly generate UUID as the ID.
NewInmemSnapshotStore creates a blank new InmemSnapshotStore.
NewInmemStore returns a new in-memory backend.
NewInmemTransport is used to initialize a new transport and generates a random local address if none is specified.
NewInmemTransportWithTimeout is used to initialize a new transport and generates a random local address if none is specified.
NewLogCache is used to create a new LogCache with the given capacity and backend store.
NewNetworkTransport creates a new network transport with the given dialer and listener.
NewNetworkTransportWithConfig creates a new network transport with the given config struct.
NewNetworkTransportWithLogger creates a new network transport with the given logger, dialer and listener.
NewObserver creates a new observer that can be registered to make observations on a Raft instance.
NewRaft is used to construct a new Raft node.
NewTCPTransport returns a NetworkTransport that is built on top of a TCP streaming transport layer.
NewTCPTransportWithConfig returns a NetworkTransport that is built on top of a TCP streaming transport layer, using the given config struct.
NewTCPTransportWithLogger returns a NetworkTransport that is built on top of a TCP streaming transport layer, with log output going to the supplied Logger.
ReadConfigJSON reads a new-style peers.json and returns a configuration structure.
ReadPeersJSON consumes a legacy peers.json file in the format of the old JSON peer store and creates a new-style configuration structure.
RecoverCluster is used to manually force a new configuration in order to recover from a loss of quorum where the current configuration cannot be restored, such as when several servers die at the same time.
ValidateConfig is used to validate a sane configuration.

# Constants

AddNonvoter makes a server Nonvoter unless its Staging or Voter.
explicit 0 to preserve the old value.
AddVoter adds a server with Suffrage of Voter.
Candidate is one of the valid states of a Raft node.
DefaultMaxRPCsInFlight is the default value used for pipelining configuration if a zero value is passed.
256KB.
DemoteVoter makes a server Nonvoter unless its absent.
Follower is the initial state of a Raft node.
Leader is one of the valid states of a Raft node.
LogAddPeerDeprecated is used to add a new peer.
LogBarrier is used to ensure all preceding operations have been applied to the FSM.
LogCommand is applied to a user FSM.
LogConfiguration establishes a membership change configuration.
LogNoop is used to assert leadership.
LogRemovePeerDeprecated is used to remove an existing peer.
Nonvoter is a server that receives log entries but is not considered for elections or commitment purposes.
Promote changes a server from Staging to Voter.
ProtocolVersionMax is the maximum protocol version.
ProtocolVersionMin is the minimum protocol version.
RemoveServer removes a server entirely from the cluster membership.
Shutdown is the terminal state of a Raft node.
SnapshotVersionMax is the maximum snapshot version.
SnapshotVersionMin is the minimum snapshot version.
Staging is a server that acts like a Nonvoter.
SuggestedMaxDataSize of the data in a raft log entry, in bytes.
Voter is a server whose vote is counted in elections and whose match index is used in advancing the leader's commit index.

# Variables

ErrAbortedByRestore is returned when a leader fails to commit a log entry because it's been superseded by a user snapshot restore.
ErrCantBootstrap is returned when attempt is made to bootstrap a cluster that already has state present.
ErrEnqueueTimeout is returned when a command fails due to a timeout.
ErrLeader is returned when an operation can't be completed on a leader node.
ErrLeadershipLost is returned when a leader fails to commit a log entry because it's been deposed in the process.
ErrLeadershipTransferInProgress is returned when the leader is rejecting client requests because it is attempting to transfer leadership.
ErrLogNotFound indicates a given log entry is not available.
ErrNothingNewToSnapshot is returned when trying to create a snapshot but there's nothing new committed to the FSM since we started.
ErrNotLeader is returned when an operation can't be completed on a follower or candidate node.
ErrNotVoter is returned when an operation can't be completed on a non-voter node.
ErrPipelineReplicationNotSupported can be returned by the transport to signal that pipeline replication is not supported in general, and that no error message should be produced.
ErrPipelineShutdown is returned when the pipeline is closed.
ErrRaftShutdown is returned when operations are requested against an inactive Raft.
ErrTransportShutdown is returned when operations on a transport are invoked after it's been terminated.
ErrUnsupportedProtocol is returned when an operation is attempted that's not supported by the current protocol version.

# Structs

AppendEntriesRequest is the command used to append entries to the replicated log.
AppendEntriesResponse is the response returned from an AppendEntriesRequest.
Config provides any necessary configuration for the Raft server.
Configuration tracks which servers are in the cluster, and whether they have votes.
DiscardSnapshotSink is used to fulfill the SnapshotSink interface while always discarding the .
DiscardSnapshotStore is used to successfully snapshot while always discarding the snapshot.
FailedHeartbeatObservation is sent when a node fails to heartbeat with the leader.
FileSnapshotSink implements SnapshotSink with a file.
FileSnapshotStore implements the SnapshotStore interface and allows snapshots to be made on the local disk.
InmemSnapshotSink implements SnapshotSink in memory.
InmemSnapshotStore implements the SnapshotStore interface and retains only the most recent snapshot.
InmemStore implements the LogStore and StableStore interface.
InmemTransport Implements the Transport interface, to allow Raft to be tested in-memory without going over a network.
InstallSnapshotRequest is the command sent to a Raft peer to bootstrap its log (and state machine) from a snapshot on another peer.
InstallSnapshotResponse is the response returned from an InstallSnapshotRequest.
LeaderObservation is used for the data when leadership changes.
Log entries are replicated to all members of the Raft cluster and form the heart of the replicated state machine.
LogCache wraps any LogStore implementation to provide an in-memory ring buffer.
NOTE: This is exposed for middleware testing purposes and is not a stable API.
MockFSM is an implementation of the FSM interface, and just stores the logs sequentially.
NOTE: This is exposed for middleware testing purposes and is not a stable API.
MockMonotonicLogStore is a LogStore wrapper for testing the MonotonicLogStore interface.
NOTE: This is exposed for middleware testing purposes and is not a stable API.
NetworkTransport provides a network based transport that can be used to communicate with Raft on remote machines.
NetworkTransportConfig encapsulates configuration for the network transport layer.
Observation is sent along the given channel to observers when an event occurs.
Observer describes what to do with a given observation.
PeerObservation is sent to observers when peers change.
Raft implements a Raft node.
ReloadableConfig is the subset of Config that may be reconfigured during runtime using raft.ReloadConfig.
RequestPreVoteRequest is the command used by a candidate to ask a Raft peer for a vote in an election.
RequestPreVoteResponse is the response returned from a RequestPreVoteRequest.
RequestVoteRequest is the command used by a candidate to ask a Raft peer for a vote in an election.
RequestVoteResponse is the response returned from a RequestVoteRequest.
ResumedHeartbeatObservation is sent when a node resumes to heartbeat with the leader following failures.
RPC has a command, and provides a response mechanism.
RPCHeader is a common sub-structure used to pass along protocol version and other information about the cluster.
RPCResponse captures both a response and a potential error.
Server tracks the information about a single server in a configuration.
SnapshotMeta is for metadata of a snapshot.
TCPStreamLayer implements StreamLayer interface for plain TCP.
TimeoutNowRequest is the command used by a leader to signal another server to start an election.
TimeoutNowResponse is the response to TimeoutNowRequest.

# Interfaces

AppendFuture is used to return information about a pipelined AppendEntries request.
AppendPipeline is used for pipelining AppendEntries requests.
ApplyFuture is used for Apply and can return the FSM response.
BatchingFSM extends the FSM interface to add an ApplyBatch function.
ConfigurationFuture is used for GetConfiguration and can return the latest configuration in use by Raft.
ConfigurationStore provides an interface that can optionally be implemented by FSMs to store configuration updates made in the replicated log.
FSM is implemented by clients to make use of the replicated log.
FSMSnapshot is returned by an FSM in response to a Snapshot It must be safe to invoke FSMSnapshot methods with concurrent calls to Apply.
Future is used to represent an action that may occur in the future.
IndexFuture is used for future actions that can result in a raft log entry being created.
LeadershipTransferFuture is used for waiting on a user-triggered leadership transfer to complete.
LogStore is used to provide an interface for storing and retrieving logs in a durable fashion.
LoopbackTransport is an interface that provides a loopback transport suitable for testing e.g.
MonotonicLogStore is an optional interface for LogStore implementations that cannot tolerate gaps in between the Index values of consecutive log entries.
ReadCloserWrapper allows access to an underlying ReadCloser from a wrapper.
ServerAddressProvider is a target address to which we invoke an RPC when establishing a connection.
SnapshotFuture is used for waiting on a user-triggered snapshot to complete.
SnapshotSink is returned by StartSnapshot.
SnapshotStore interface is used to allow for flexible implementations of snapshot storage and retrieval.
StableStore is used to provide stable storage of key configurations to ensure safety.
StreamLayer is used with the NetworkTransport to provide the low level stream abstraction.
Transport provides an interface for network transports to allow Raft to communicate with other nodes.
WithClose is an interface that a transport may provide which allows a transport to be shut down cleanly when a Raft instance shuts down.
WithPeers is an interface that a transport may provide which allows for connection and disconnection.
WithPreVote is an interface that a transport may provide which allows a transport to support a PreVote request.
WithRPCHeader is an interface that exposes the RPC header.
NOTE: This is exposed for middleware testing purposes and is not a stable API.

# Type aliases

ConfigurationChangeCommand is the different ways to change the cluster configuration.
FilterFn is a function that can be registered in order to filter observations.
LogType describes various types of log entries.
ProtocolVersion is the version of the protocol (which includes RPC messages as well as Raft-specific log entries) that this server can _understand_.
RaftState captures the state of a Raft node: Follower, Candidate, Leader, or Shutdown.
ServerAddress is a network address for a server that a transport can contact.
ServerID is a unique string identifying a server for all time.
ServerSuffrage determines whether a Server in a Configuration gets a vote.
SnapshotVersion is the version of snapshots that this server can understand.