Let’s suppose you have an e-commerce business which uses a distributed cache such as NCache for faster response times. During the holiday season, your cache cluster is expected to serve thousands of connected clients, but instead, your Customer Support team is bombarded with complaints about website downtime and slow user experience. What went wrong? Not monitoring the cache under peak loads is what went wrong.
Monitoring your cache during production helps you identify the warning signs before they become troublesome. This prevents your business from experiencing potential network interruption, memory overheads and more.
Rich Set of Monitoring Tools in NCache
NCache comes packed with various tools to help you monitor your caches. These include:
- NCache Web Manager: This is a web-based management tool to configure your caches and view their statistics. This tool is shipped with NCache and allows you to manage your caches like adding or removing nodes from the cache, configuring security and more.
- NCache Web Monitor: This is a web-based monitoring tool that gives you a real-time assessment of how your distributed caches and remote clients are performing. This contains an existing dashboard that provides simple drag-and-drop counters to be monitored per node. You can also design custom dashboards according to your metrics of interest.
- NCache Server Logs: All activity being logged for the cache aids in detecting problems before they get serious, or even just observing cache behavior under certain environments. All your cache/bridge activity is logged in files on each server node by default. NCache also provides a sophisticated Log Viewer to organize your logs for better readability of the logs.
- NCache Perfmon Counters: NCache provides numerous counters to Windows Performance Monitor so you can monitor cache performance using PerfMon compatible tools as well. The counter information can help to determine process limitations and fine-tune the environment and applications if needed.
- NCache Event Logs: NCache also logs its events according to severity level in the Windows Event Logs which provides a detailed record of all security, application and system applications. This provides a quick diagnosis of any errors occurring in the cache cluster. These event logs are also displayed within the NCache Monitor along with other metrics.
Conduct Baseline Performance Test Before Production
Before you begin monitoring your caches in production, it is recommended you perform a pre-production baseline test with your live environment configuration to determine the acceptable performance threshold for your cache. If the production is live, you can perform this test in staging.
This baseline performance test enables you to monitor your cache performance against this threshold and helps diagnose a particular problem if you know what the optimum performance should be. For example, you can choose to add more servers if the memory utilization is consistently higher than the baseline mark.
1. Application Performance Baseline
You need to test your environment from the following aspects:
- Application Tier Testing: This testing is independent of NCache and is solely the performance of your application. For example, if it’s a web app, you need to test the response time of the page requests.
- Database Tier Testing: This is also independent of NCache and involves the database response times for queries, network overhead, and performance for large database sets.
2. NCache Performance Baseline
So, where does NCache fit into this? You need to note the performance numbers of NCache at the time you’re getting an acceptable performance from your end application and the database. These performance numbers become your baseline for NCache and include requests/sec, average time/operation, object size, memory/CPU.
You can also monitor the event logs to understand the events in a healthy, working state cluster, to keep as a reference to compare when they go in production. These baseline details can be shared with the monitoring teams for them to have a comparison once they’re monitoring the production environment. This will make sure there are no anomalies.
NCache performance can impact your application’s performance. An anomaly could be that one of the nodes within the cluster is taking more load than it should. Compare to the baseline and identify if this is an actual anomaly.
Some anomaly can be normal, such as when the load increases, the CPU consumption on all the server nodes in the cluster increases. That is perfectly normal, as nothing is failing at this point. You just need to add another server to the cluster to share the load among all of them.
Anomalies can help lead to bugs as well because just like any other software, NCache also has bugs that are constantly fixed by our teams. These can be detected through the behavior of NCache, your application, NCache logs, etc.
Monitor NCache Performance in Production
One of the metrics you need to monitor in NCache is the latency or the response time for an operation performed on the cache. In any environment, a little latency can be expected, owing to multiple factors such as network speed or complex computational processes. However, if a bulk operation took 10 μs on the baseline test, but is now taking 50 μs regularly, this is a red flag for your cluster performance. You may also run into regular Operation timeout exceptions within your application. This is why monitoring latency spikes is crucial.
If your application is facing some performance anomaly within your application, always check if NCache has any performance-related anomaly at that point. If NCache is performing normally as it should, then that means this anomaly is not caused by NCache and it is something that is specific to your application or environment.
For example, during production, the network cards on the servers may start to get overwhelmed. This can be because either both the server-server and client-server communications are on the same card, or the object size is too big which increases the usage from the network card. Here, you can take advantage of NCache’s dual network card feature to separate the server-server and client-server communication. Similarly, for the client-side, you can separate the NCache and client machine communication. This will divide the load and make sure there is no performance hindrance.
Moreover, performance can also be impacted by serialization/deserialization or compression/decompression cost of your cache objects at the client end.
1. Server-side Counters
The NCache server counters you can monitor include the following counters. The nature of your application determines the counters to be monitored. For example, for a write-heavy application, you may need to keep an eye out on the Average μs/Write-thru, Average μs/datasource write or Average μs/insertbulk counters.
Average μs/add | Average μs/addbulk | Average μs/cache operation | Average μs/datasource update | Average μs/datasource write |
Average μs/fetch | Average μs/fetchbulk | Average μs/insert | Average μs/insertbulk | Average μs/Query Execution |
Average μs/Read-thru | Average μs/Write-thru | Average μs/remove | Average μs/removebulk | Lucene Average μs/Write Operation |

2. Client-side Counters
Likewise, you can monitor the client-side counters for latency:
Average μs/add | Average μs/addbulk | Average μs/cache operation | Average μs/datasource update | Average μs/datasource write |
Average μs/fetch | Average μs/fetchbulk | Average μs/insert | Average μs/insertbulk | Average μs/Query Execution |
Average μs/Read-thru | Average μs/Write-thru | Average μs/remove | Average μs/removebulk | Lucene Average μs/Write Operation |

NCache Details Multiple NICs Docs
Monitoring NCache Cluster Health in Production
If you’re dealing with a large number of clients in a distributed cache cluster, it goes without saying that you need to ensure that the cluster is healthy and tuned under peak loads. NCache server and client application health can be monitored through NCache tools that show you the healthy activity through cache counters.
Usually, the data centers have really good networks but we have noticed within our customer environments that the sockets break or network may be interrupted. This causes delays as the communication gets interrupted even if the whole connection does not break. Hence, it is necessary to monitor that the network does not become partially connected as this results in split-brain and clients have interrupted connections.
NCache initiates an auto-recovery mechanism to resolve this, which is an expensive task. Hence, you need to monitor your cluster health.
Using NCache Web Monitor, you can monitor various metrics for cache health:
1. Cluster Health
You can see the status of each server node in a cluster, its connection with the other nodes and the number of connected clients in one glance.
2. Windows Event Logs
You can easily check for any errors in the event log, which also displays a detailed message against each event. So, in case of a partially connected cluster, you can effortlessly diagnose whether it is because of split-brain or some other reason, as logged in the Event Logs window.
3. API Logs
You can also choose to log API calls from the server node to the client – however, this is a memory extensive counter.

4. System Resources
To verify if your cluster is healthy, you also need to be monitoring your CPU utilization, memory spikes, and network usage to ensure that your applications are not impacted by these resources. If you see a constant hike in the CPU utilization, for example, you can choose to increase your CPU resources.

5. NCache Alerts
NCache also provides a mechanism to send alerts on certain events like node start/stop or state transfer started. These are sent to a provided email so you are notified of any unexpected activity anywhere. You can read more about this in NCache Docs. Apart from these, cache health alerts for CPU utilization, queue size, memory, network bandwidth, requests/sec are also logged in alerts.xml if the values cross the pre-configured threshold value.
NCache Details Split Brain Docs
Monitoring NCache Load/Capacity in Production
You need to determine the general peak load for your cache cluster and transactions performed on each server. If influx increases, let’s say during an annual sale, there is a chance of environment instability or uncertain behavior. For this, you need to monitor how many fetches or requests per second are being made on each server to quickly perform root cause analysis of any performance-related issue you might be encountering.
The throughput against the load will determine if there is a need for increasing the capacity. If you have already performed a baseline test for load monitoring, and the statistics show a consistent spike in the number of transactions, you can choose to scale up by increasing the CPU resources or scale out by adding more cache servers.
1. Server-side Counters
The following counters can be monitored to check the load on the environment at the server end. The most commonly used counters include Cluster ops/sec, Additions/sec, Fetches/sec and Expirations/sec.
Bytes Received/sec | Bytes Sent/sec | Client Bytes Sent/sec | Cache Size | Data Operations/sec |
Cluster Ops/sec | Bridge Operations Received/sec | Bridge Operations Sent/sec | Client Responses/sec | Client Requests/sec |
Evictions/sec | Fetches/sec | Deletes/sec | Expiration/sec | Messages Published/sec |
Readthru/sec | Queue Count | Requests/sec | Updates/sec |

2. Client-side Counters
At the client end, you can monitor the load capacity from any of the counters:
Bytes Received/sec | Bytes Sent/sec | Client Bytes Sent/sec | Cache Size | Data Operations/sec |
Cluster Ops/sec | Bridge Operations Received/sec | Bridge Operations Sent/sec | Client Responses/sec | Client Requests/sec |
Evictions/sec | Fetches/sec | Deletes/sec | Expiration/sec | Messages Published/sec |
Readthru/sec | Queue Count | Requests/sec | Updates/sec |

Summing it Up
NCache is a feature-rich distributed data store with 100% native .NET and Java support. Hence, when your cache clusters are running in a high transaction production environment, it is essential to monitor the nodes, cluster and client connections along with the cache resources like memory and network bandwidth. NCache comes packed with multiple tools and alerts to make monitoring of your cluster environment as convenient as possible. This not only allows you to keep an eye on any unexpected spikes in the metrics, but it also helps you to easily diagnose the source of performance degradation.