Split-Brain in medicine refers to the state of communication malfunction inside the brain; where half of the brain is unaware of the other half’s behavior. Split-Brain in distributed computing refers to the communication loss between the active servers of a cluster. When this happens, all sub-clusters lose all synchronization and heartbeat connections with one another.
Just like in a functioning brain, the chances of Split-Brain occurring in your distributed system are exactly the same. If such a calamity befalls your distributed system, it will be a real horror for your system administrator and there’s no recovering from that. Unless you are using NCache as your distributed cache. Only then do you have hope.
Split-Brain in NCache Cluster
NCache creates self-healing dynamic clusters with servers that are interconnected for intra-cluster communication. But like any distributed system, the NCache cluster can also face Split-Brain problem where one or more cache servers get disconnected from the rest of the cluster and form sub-clusters. And just like the brain, your cluster gets divided into halves and each knows nothing about the other’s existence.
Let’s take a cluster of 5 nodes as an example. The cluster works fine, caching, communicating, processing but then out of nowhere comes a network glitch that divides the perfectly running cluster into two.
When this happens in the cluster, both halves of the cluster start acting independently assuming that the other half has gone down, hence resulting in independent sub-clusters.
This behavior will lead to both halves having their own copy of the data which is being updated by the clients without any synchronization. That defeats the purpose of using a distributed cache when there are cache operation failures and data integrity problems in your application.
How does NCache Recover from Split-Brain?
The first step of recovering from Split-Brain is to detect it in the cluster. And lucky for you, NCache has the ability to automatically detect the occurrence of split-brain. Here’s how.
NCache maintains cluster membership on all cache servers that comprise a cluster. So, whenever the connection breaks between the servers, the entire cluster gets notified. Both halves (sub-clusters) assume that they are the surviving cluster and start working independently with the data stored. On top of acting individually as to not hinder the performance, the sub-clusters also keep trying to reconnect with the “lost cluster” to get the initial cluster back together. In the meantime, both sub-clusters log events to the Windows Event Log indicating the state of the cluster. The sub-clusters can also notify the cache admin through Email Notifications that connection with certain servers has been lost.
Till this point, neither halves actually realize that they encountered a split-brain. It’s only when the network connection is restored that they finally understand the cause of the cluster division.
When the connection is restored and the servers start communicating with one another, that’s when the decision of who gets to be the “winner” cluster needs to be made. The winner cluster is basically the cluster that fulfills the following sorting criterion:
- The sub-cluster containing the maximum number of nodes. This is done to ensure minimal data loss.
- In case the sizes of both the sub-clusters are the same, the sub-cluster whose coordinator node has a lower IP address will be considered as winner cluster.
Once decided, it is the winner cluster’s responsibility to restart the “loser” cluster and redistribute data among the new nodes. Through all this redistribution, the loser cluster will lose its data, but on the bright side, the winner cluster persists its data.
Enabling Split-Brain Auto Recovery
By default, the Split-Brain Auto Recovery feature of NCache is disabled. You should enable this feature if your data cannot bear complete loss. Provided below are the ways through which you can enable Split-Brain Auto Recovery for your cluster.
Using NCache Web Manager
You can easily enable Split-Brain Recovery for your cache cluster using the NCache Web Manager. Follow the help provided in Enable Split-Brain Auto Recovery to enable this feature.
Using Cache Config File
Split-Brain Recovery can be enabled through NCache configuration files. Manually edit the cache config file by following the steps mentioned here: Manually Edit NCache Configuration for Split-Brain Recovery.
<split-brain-recovery enable="True" detection-interval="60"/>
In a Nutshell…
Sometimes in the middle of processing data, your cache cluster encounters a network glitch that divides your cluster into sub-clusters. This division, no matter how logical, still poses a threat to your cached data. This whole scenario resembles the medical term Split-Brain Syndrome. To rectify the possible damage this syndrome inflicts on your cluster, NCache offers a remedy in the form of Split-Brain Auto Recovery feature. If you have NCache then you don’t need to worry about managing your cluster once it has been broken into halves. NCache always saves the day.