Bridge For WAN Replication
This feature is only available in NCache Enterprise Edition.
For large scale applications, distributed caches are used to improve the performance, reliability and runtime scalability of your applications. So, distributed caches can be a very important part of your disaster recovery plan, a disaster ranging from natural disasters to internal hardware disasters or software failure.
The best and most used disaster recovery plan is live data replication to other backup sites. So, when needed, you can redirect your live users to your backup site without any errors. However, this requires making sure that your active and backup caches are both synchronized at any time. If they are not synchronized, then that can affect your applications’ cache clients.
NCache provides you with WAN replication through the bridge feature. A bridge is created between multiple cluster caches and data is replicated from the source to the other site through that bridge.
If you not only have disaster recovery plans but also want to deploy your application on geographically separated regions for widely spread customers, Data replication also solves the problem for you. Here you can have two or more active sites, which will deal with users of related regions and also can be used as backup of other region sites.
Remote data replication is a critical component for any plan to ensure effective and efficient protection of data and rapid recovery from a major interruption. Synchronous replication of data is good for the cluster internally, but its impact on performance becomes a significant consideration when clusters of caches are geographically separated. The bridge is designed for the scenarios that involve replication of data from on-site cache(s) to other on-site/off-site cache(s) across the WAN for disaster recovery. Due to asynchronous replication, all clients connected to the active cache(s) get an impression that the operations are being performed on the active cache while a complete backup is taken to the other cache(s) seamlessly.
When an operation is performed on the source cache, it is asynchronously handed over to the bridge. This operation is then queued in a queue maintained by the bridge. Operations from the queue are transferred to the target cache when bridge finds the target cache available and ready to accept operations. With the bridge, it is ensured that:
- There is no performance degradation.
- Operations are performed in same sequence as they were on original cache.
- Operations are not lost in case of connection failure.
Pluggable Caching Architecture: Caches are not aware of each other; they just know about the bridge and replicate their data to bridge. Due to this loose coupling, you can configure a bridge between multiple caches, irrespective of their cache topology. You can freely remove caches configured with a bridge.
Data Integrity: Operations performed at source cache are enqueued by the bridge maintaining the actual order in which they were performed at the source cache. The bridge performs operations on target cache in the same order. Conflicts are resolved on the target cache. In this way, caches eventually become consistent.
Dedicated Bridge Service: The bridge is also a stand-alone and dedicated service like the cache service so your cache operations will not be affected if bridge operations are delayed due to latency in the network.
Configuring Bridge: You can configure your bridge on the same server where your cluster cache resides or you can create it on separate server node. Then, you can add any of your cluster caches into bridge and data will be replicated between them.
Disaster Recovery: You can configure bridge between an active and passive data center for disaster recovery.
Dealing Geographically Spread Customer: You can have two or more active sites which deals with the users of related regions and also can be used as backup of other region sites.
Asynchronous Replication: For WAN replication, asynchronous replication is used so that cache operations will not suffer in case of delays in bridge operations.
Queue Backup: The bridge is basically a multi-node clustered queue in which one node is active and other is passive, having a backup of active queue to avoid data loss on the bridge.
Connection Retries: The bridge also tries to replicate all operations by retrying when any connection failure occurs.
Bridge Replicator Queue: The bridge replicator queue size is included in the cache size. If cache is unable to connect to the bridge then operations will be queued on cache until cache is full. If cache is full then eviction will occur on cache items to make space for increasing bridge queue.
Caches: You can use any topology for cluster caches that will be a part of the bridge. You can also use different topology caches on each site in one bridge. However, it is highly recommended that the same topology on both sides is used.
Cache Synchronization Modes
The NCache bridge can have multiple caches connected to it, and you can provide any of the following sync modes to the cache for disaster recovery:
An active cache can be of any topology, where all clients connect and perform read and write operations that are replicated via bridge to the other connected caches.
A passive cache can be of any topology, but it is recommended that it is same as active cache. However, all operations performed on the active cache are replicated to the passive cache. Clients can connect to the passive cache and perform both read and write operations but those operations are not replicated to the active cache. Modifications can be done at the passive cache if required.
If the active site goes down by any cause, you can redirect your requests to your passive site by making it active. Your passive site will behave as active and treat all requests without any failures.
When your old active site is ready and restarted, you can reform your configuration as it was before. Make both sites active and this will transfer all data from old passive to the old active site. When all data is transferred, you can make configurations to your passive site and redirect all requests to the active site.
In case of communication between two active caches, it is possible that the same cached data is updated on both caches almost simultaneously, which produce a conflict. To resolve this conflict, NCache provides a configurable Conflict Resolver which resolves operation conflict on bridge time. By default, the latest operation "wins" and is applied on cache in case of conflict.
If any site goes down, you can redirect all requests to other active sites. When the downed site is up again, data is transferred from the already running site to this site and can redirect that region's request to this site.
It is recommended that the caches have the same configurations other than topology to avoid issues. For example, if the data source is configured on one cache, you should configure it on the other cache too. This is because the same operation specifications from one site cache will be replicated to the other site.
A state transfer can be manually triggered between two caches, if you want to synchronize the state of the source cache to the target cache. This happens through a bridge.
The state transfer operations are queued at the cache end and as soon as it connects to the bridge, it starts sending its operations to the bridge which are relayed to the target cache. Once state transfer between the caches has been initiated, you can not initiate another state transfer - to ensure stability of the system.
Case 1: Cache in state transfer occurs:
If Cache A and Cache B have a state transfer in progress, Cache A goes down and is not coming back. Since the bridge is under state transfer, any other incoming state transfer request can be rejected because of this. To cater to this, the bridge intelligently determines if it has stopped receiving the operations from Cache A after a specified interval. Thus, that state transfer is considered corrupted and the bridge allows the new state transfer request.
Case 2: Cache and Bridge have network glitch but do not disconnect:
In case the cache and bridge connection experiences a network glitch, such that it is partially connected, the operation and state transfer queues are still intact. Hence, the state transfer can be resumed as no operations have been lost.
Cache A is the active cache whereas Cache B is passive. Cache A sends operations to Cache B through bridge but Cache B goes down for some time. This means that the bridge queue is filling up from operations being sent from Cache A, but is not getting dequeued as Cache B has gone down. There will come a point where the bridge queue will fill up completely and will not have any space to store any more operations. Hence, it tells Cache A to hold the operations at its end until it has some space. The operations are then queued at the cache end only. Let's suppose that Cache B comes back up, and the bridge replicator queue starts sending the operations to it, hence being dequeued. Once a configurable amount of space (20MB by default) is freed, the bridge tells Cache A that it can now send its queued operations.
However, in case the cache queue fills up and eviction is enabled, the cache will evict its cached items, but not the queued operations. This prevents loss of operations.