In this era of fast and reliable application processing, everybody is opting for distributed caching to get the best performance out of their applications. If your .NET application attracts high traffic resulting in a very high number of transactions, then you definitely need NCache.
NCache provides fast, linearly scalable distributed caches to your .NET applications. Among the various topologies that NCache provides to you, the most popular is the Partitioned-Replica Cache. A Partitioned-Replica topology gives you the best of both worlds, linear scalability, and high availability.
In NCache’s Partitioned-Replica cluster, data is partitioned as well as replicated on all nodes. This way, if a server node goes down, its clients can continue with their operations by interacting with its replica counterpart. To achieve this, state transfer needs to be triggered to balance the orphaned data onto every other node of the cluster.
Partitioned-Replica cache provides multiple benefits to you; from scalability to availability to reliability. But what if you require one of the nodes to be stopped for maintenance reasons knowing that you will be restarting that node not that long after? Let us see what this behavior is and how it could be a challenge for you.
Auto-Rebalancing Challenges During Maintenance
Every time a node is stopped in a Partitioned-Replica cluster; a state transfer is triggered. State transfer means that data needs to be rebalanced throughout the cluster. This process could take a lot more time than anticipated, which affects your application’s performance, especially when the nodes hold tens of gigabytes of data.
The following figure explains what changes occur in the Partitioned-Replica cluster when a node is stopped.
In this figure, you have an NCache Partitioned-Replica cluster with three nodes. Here, five stages have been introduced to explain what happens when a node is stopped in a cluster.
- Stage 1: NCache’s Partitioned-Replica with three nodes, A, B and C, each having an active and a replica node.
- Stage 2: Node C is briefly stopped for Maintenance. Its active and replica are not a part of the cluster anymore and state transfer is triggered.
- Stage 3: Data redistribution in the cluster (totally unnecessary). Here, orphaned data of node C is divided between the remaining nodes A & B after state transfer has stopped. According to this division, their replica nodes are updated too. This state transfer is totally unnecessary because the stopped node will be restarting soon.
- Stage 4: Node C restarts. The cluster, at this stage, behaves as if node C has left the cluster. After data redistribution, Node C is started again.
- Stage 5: Node C rejoins the cluster and again does data redistribution. As its data had already been distributed between A and B, hence when C joins again, state transfer is triggered throughout the cluster again to assign new data to the nodes.
Ideally, this seems like the perfect solution. You stop a node, state transfer occurs. You fix whatever you wanted to fix and start that node again. State transfer is again triggered to balance all buckets.
But why isn’t this the ideal solution? What goes wrong here?
There are several drawbacks of prompting state transfer unnecessarily. They are:
- State transfer is a costly process. Meaning that each state transfer involves multiple network calls and a lot of processing overheads.
- State transfer thread takes time to transfer buckets. It might take seconds at times but when the nodes hold a large amount of data, state transfer could last for several minutes.
- It is possible that the state transfer takes more time than you had anticipated and the node was started again. In such a case, state transfer within state transfer occurs which is a disaster in itself.
The Solution: Maintenance Mode Aware Partitioned-Replica
Taking into consideration all these setbacks of unnecessary state transfer that occurs whenever a node leaves and joins a cluster, NCache provides you with Maintenance Mode.
Maintenance Mode allows you to stop a node for a specific amount of time and start it when its maintenance is over. This mode ensures that during the time that a node is going through maintenance, state transfer thread is not triggered within the cluster. Therefore, it is extremely beneficial where the cluster comprises a large amount of data.
How Maintenance Mode is different from the normal stopping of a node, is explained in the following figure.
- Stage 1: The structure of Partitioned-Replica topology of NCache is shown where a POR cluster comprises three server nodes that contain a huge amount of data.
- Stage 2: Node C stopped. Node C is stopped for maintenance through Maintenance Mode.
- Stage 3: Data redistribution. Here, the replica of C becomes active and starts addressing node C’s clients. This eliminates the need to trigger state transfer, hence, state transfer thread is halted for as long as the cluster is under maintenance. This solves the problem faced when unnecessary state transfer balances the data across the cluster after node C is stopped.
- Stage 4: Node C restarted. After being stopped for maintenance purposes, node C is started. Whenever the cluster exits Maintenance Mode, Node C is started again.
- Stage 5: Data transfer. It is the phase of the cluster where Node C receives all data from its replica part and updates the entire node (i-e Active C and Replica B) through state transfer.
How to stop a node for maintenance
You can stop a node for maintenance from your Web Manager. On stopping, you are asked to mention an estimated time for which you want to keep that node under maintenance. This timeout is considered as a time period throughout which no state transfer can be triggered.
These following steps allow you to stop whichever node you want to stop for maintenance.
- Access your NCache Web Manager
- Go to Clustered Caches and select the cluster that needs maintenance
- Among its various nodes, select the one that requires maintenance.
- Go to its settings and select the option Stop for Maintenance.
For more detailed instructions on how to stop a node for maintenance, please refer to Stop Node for Maintenance.
How to exit a node from Maintenance Mode
Once a cluster enters Maintenance Mode, the Web Manager is used to exit that cluster from it. Following are the steps that need to be followed:
- From your NCache Web Manager, go to Clustered Caches
- Select the cluster that is under maintenance.
- Go to its Settings and select Exit Maintenance Mode.
For further information about how to exit a node from Maintenance Mode, refer to our documentation at Exit Maintenance Mode.
Other than from the Web Manager, there are several ways through which a node can exit from maintenance support. These scenarios need to be taken under observation as some might affect your application’s performance.
A node can exit Maintenance Mode in the following cases:
- When the node under maintenance is started: If the node that is under maintenance is started manually, either through the manager or through the PowerShell command, that node exits from Maintenance Mode.
- When the timeout expires: When the timeout provided for maintenance expires, state transfer is triggered and the cluster automatically exits from Maintenance Mode.
- When a node leaves the cluster: No node can leave a cluster gracefully as long as it is under maintenance. But if one of the nodes of that cluster leaves forcefully, that cluster inevitably falls out of Maintenance Mode, despite being still under maintenance process. Here, the point that you should pay attention to is that if the very node that was under maintenance leaves, there are high chances of data loss.
No matter which method you use to exit Maintenance Mode, that signal alone is the cue for state transfer thread to trigger state transfer throughout the cluster.
If you desire a way to accommodate patching in your Partitioned-Replica clustered cache without compromising the application’s performance, then check out NCache’s Maintenance Mode. Maintenance Mode allows you to fix a bug, add a patch, upgrade software or hardware, without introducing any application downtime. All you have to do is follow the above-mentioned steps and check out for yourself how extraordinary NCache’s Maintenance Mode is.