Single Leader Replication- Failover

As discussed in the last section (Replication), there are three popular algorithm for replication. One of them is single leader replication.

Single leader replication have only one leader and followers.

Clients connect to only leader for write, however they can connect to followers for read.

On its local disk, each of the follower keeps a log of the data changes it has received from the leader.
If a follower crashes and is restarted or if the network between follower and leader is interrupted, the follower can recover quite easily from it's log.
Log tells the last transaction that was processed before the fault occurred. Thus the follower can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected.
When it has applied the changes, it caught up to the leader and can continue receiving a stream of data changes as before.

Time out technique
Nodes frequently bounce messages back and fourth between each other , if node does not respond for some period of time (say 30 secs), it is assumed to be dead.
Once it is confirmed that leader is failed, then its time to chose new leader

This could be done through an election process.
The best candidate for leadership is usually the replica with the most up to date data changes from old leader.
Getting all the node to agree on a new leader is a consensus problem.

Once new leader is selected, now client needs to send write request to the new leader.
If the old leader comes back, it still believe that it is the leader. The system needs to ensure that the old leader becomes the follower and recognize the new leader.

If Asynchronous replication is used, the new leader may not have up to date data.
If old leader rejoin to the cluster, what should happen to those writes (data which are up to date on older leader).

The new leader may have received conflicting writes in the meantime.
The most common solution is that old leader's un-replicated data to simply be discarded, which will violate data durability expectation.

In certain fault scenario, it could happen that two nodes both believe that they are leader. This situation is called split-brain.
This is dangerous, if both leaders accept writes and there is no process to resolving conflicts, data is likely to be lost or corrupted.
As a safety catch, some system have a mechanism to shut down one node if two leaders are detected.
However, if this mechanism is not carefully designed, you can end-up with both nodes being shut down.

A longer timeout means a longer time to recovery in the case where the leader fails.
If timeout is too short, there could be unnecessary failover. e,g, a temporary load spike could cause a node's response time to increase above the timeout or network glitch could cause delayed packets.

Most of the concepts explained here are common for leader based replication(single or Multi leaders)

System Design