Alachisoft.com

How to Recover from a Partially Connected Cache Cluster Without Any Downtime

Recovering from partially connected cluster:

Partial connectivity means two or more cache servers are connected with each other but not fully connected. It could be that the active partition on one cache server is no longer connected to its replica on another cache server even though the active partition on that other server is connected to its replica on the original server. Or, it could be that one of the cache servers is totally disconnected with other servers in the cluster.

Additionally, in Partition Replica Cache, each cache server contains one active partition and one replica partition. The replica is passive and only accessed by its active partition. But, at cache cluster layer, both active partition and the replica are seen as independent “nodes”. So, a 3 server cache cluster in Partition-Replica Cache will have a “6 node” cluster.

How to detect partial connectivity

Use View Cluster Connectivity tab in NCache Manager

  • Right click on your cache name in NCache Manager and then choose View cluster connectivity option
  • This will open another window with cluster connectivity status. You can use this tab to verify if your cache cluster is fully connected or partially connected.

Fully connected cache cluster:

In the example below, it shows a fully connected (healthy) cache cluster. There are 3 servers in the cluster and 6 “nodes”. So, each “node” is supposed to be connected to 5 other “nodes” as shown in “Connected to Nodes” column.


Node Address Connected to Nodes Status
20.200.20.100 20.200.20.100, 20.200.20.101, 20.200.20.101,
20.200.20.102, 20.200.20.102
Fully Connected
20.200.20.101 20.200.20.101, 20.200.20.100, 20.200.20.100,
20.200.20.102, 20.200.20.102
Fully Connected
20.200.20.102 20.200.20.102, 20.200.20.100, 20.200.20.100,
20.200.20.101, 20.200.20.101
Fully Connected

Figure 1: Fully connected cache cluster

Partially connected cache cluster

In the example below, it is a partially connected cache cluster where 20.200.20.101 has lost connectivity with its replica on 20.200.20.102 and is missing a connection to 20.200.20.102 node. Hence, it has less number of nodes shown in “Connected to Columns” in front of it.


Node Address Connected to Nodes Status
20.200.20.100 20.200.20.100, 20.200.20.101, 20.200.20.101,
20.200.20.102, 20.200.20.102
Partially Connected
20.200.20.101 20.200.20.101, 20.200.20.100, 20.200.20.100,
20.200.20.102
Partially Connected
20.200.20.102 20.200.20.102, 20.200.20.100, 20.200.20.100,
20.200.20.101, 20.200.20.101
Partially Connected

Figure 2: Partially connected cache cluster


Partially connected cluster with split brain

In the example below, this is another partially connected cache with a Split Brain, where 20.200.20.102 has lost connectivity completely to other two nodes and hence showing Single Node cache Cluster status. Also, 20.200.20.100 and 20.200.20.101 are showing partially connected status and are missing 20.200.20.102 in the “connected to Nodes” column.


Node Address Connected to Nodes Status
20.200.20.100 20.200.20.100, 20.200.20.101, 20.200.20.101 Partially Connected
20.200.20.101 20.200.20.101, 20.200.20.100, 20.200.20.100 Partially Connected
20.200.20.102 --- Single Node cache Cluster

Figure 3: Split brain in partially connected cache cluster


How to fix partial connectivity

You have to start one or more cache servers to fix partial connectivity. In a 2-server cluster, you only need to start one of the cache servers. In case of a 3-server cluster, you may have to restart 2 cache servers.


Identify problem node

  • If you notice that cache cluster nodes are in partially connected state then pick the cache server which says Single Node Cluster as problem node. This is a Split brain scenario as shown above in Figure 3.
  • OR

  • If there is no server having Single node cluster status then pick the server node which has the least number of IP addresses displayed in Connected to Nodes column on cluster connectivity window in front of it. This is a partially connected cache scenario as shown above in Figure2.
  • AND/OR

  • Open cluster health window in NCache Monitor tool and then pick the node which has the least number of Clients in Clients column.
  • AND/OR

  • Pick a node with the least number of Request/sec counter value than other nodes.

Stopping cache on that node only

Once a cache cluster is in partially connected state then it requires manual intervention to recover. Here are the steps to resolve this problem,

  • Once the problem node is identified then right click on that node’s IP-Address in NCache Manager under your cache name and then choose Stop, this will stop this cache only on this node.
  • You can also use our command line tool stopcache to do the same as follows using node's IP address:

  •    C:\Program Files\NCache\bin\tools>stopcache CacheName /s 20.200.20.102

  • Start your cache again. You can do this in NCache Manager by right clicking on your Node IP under your cache name and by choosing Start option. You can also use our command line tool startcache by running following command using node's IP address.

  •    C:\Program Files\NCache\bin\tools>startcache CacheName /s 20.200.20.102

  • Verify cluster connectivity again and see if cluster has formulated in healthy state.
  • Follow above steps for all cache servers one by one in your environment if more than one cache server was found in partially connected state.

Stopping NCache service

  • Stop all caches once again on the problem node one by one
  • Restart NCache service on the problem node
  • Start all caches on by one on the problem node again if they are not set to start automatically using NCache Auto-start cache feature.
  • Verify cluster connectivity again and see if cluster has formulated in healthy state.

What to Do Next?