At times it may seem like interpreting performance counters is a closely guarded secret that we don’t want everyone to know. But the fact is, we say that it depends because it truly does. When I tell someone that their answer depends on their workload, server configuration, and a number of other factors, I sometimes get push-back saying something like, “I know ‘it depends’, but in my case, what should it be? You can tell me.” There are several reasons why we say that the target values for performance counters depend, and why we say that you need to baseline your systems (more on that later).
Workloads and server configurations vary wildly. A good number for one system or one workload could be vastly different from another system or workload.
For example, a database with a lot of little tables of look-up values could have a really high scans per second rate and still perform very well. However, a database with a lot of really large tables could experience severe performance problems with a much lower rate of scans per second. The key here is that the workloads are different and no single number would server as a good rate for scans per second.
Workloads change. When the behavior of SQL Server changes, one the first suspicions is that something has changed.
Applications get upgraded, new features get added, new technologies get implemented, existing applications get opened up to new customers, and activity can fluctuate with the seasons or trends. The normal counters for a given day or a given month or a given time of the year can see sudden jumps in counters with no degradation in performance.
SQL Server environments are constantly evolving. Operating systems change, SQL Server changes, hardware changes, networks change.
Two counters come to mind immediately when I think of this. Disk queue and Page Life Expectancy (PLE) used to be counters that DBAs swore by and everyone knew what the thresholds for those counters were. The gospel used to be that disk queue should not exceed 2 to 5 ms and PLE was okay if it was at least 300. Disk queue is no longer a reliable counter unless you understand the storage behind it. Most production SQL Servers are on arrays of disk or SANs with a large number of disks behind them. Also today’s storage can have its own caching or internal queuing systems that are completely transparent to SQL Server or the operating system.
The recommendation of a minimum of 300 for PLE was based on old non-NUMA systems that had relatively small amounts of RAM. For more info on PLE and how it has changed, read the following blog posts Page Life Expectancy isn’t what you think… and Troubleshooting Page Life Expectancy Drops by Paul Randal (blog|@PaulRandal).
This calls out the reason why declaring specific values for counters is dangerous. If you tell someone that they should always look for a certain value, and then things change, there is no way to adequately pull that recommendation back. The disk queue and PLE recommendations have not been valid for many years, but there are still blog posts out there recommending it and people relying on it.
So we tell you to baseline to know what your system generally looks like when it’s healthy. If you are baselining, you can compare your performance counters to last week’s numbers or last month’s number or even last year’s numbers. But if you’re not already baselining, and you have an emergent issue that you need to investigate right now, this advice is not going to help with this issue. Most of the performance counters are not going to be very helpful, and you will need to dig into the current activity on the server. Look for obvious things like blocking and extremely high degrees of paralellism. Look at the wait statistics for the currently active requests and try to determine if we have an issue with memory or CPU utilization or other bottlenecks.
To learn more about baselining and establishing thresholds for monitors performance counters see my new whitepaper published by Idera: Establishing and Setting Thresholds for a SQL Server Environment.