Do You Really Have A HA Issue?

I tend to be involved in many HA related issues, as is the nature of my current work agreement. The interesting part is that roughly 85% of all the “HA” issues are performance issues (with 10-12% being config issues and the last 3-5% being truly HA issues). Whether this is an FCI, AG, Replication, some combination of the aforementioned… overwhelming majority of root causes is lack of performance. Excluding FCIs, AGs and Replication are extra parts of the high availability puzzle in that they have their own processes, threads, and queues which have their own workflow and are a net add to the amount of work the system must complete. When the system is already running at or near capacity, adding on these items is not “free” in the sense of performance. These additions can easily cause a system on the verge of overload to tip and become a dumpster fire in mere seconds.

The largest tell, outside of having high cpu or low memory indicators is the signal_wait times in your wait stats coupled with the belief that hyperthreaded cores are full execution units and using HT magically doubles the workload that can be completed (spoiler: it isn’t, and it doesn’t). If there are high signal waits then we’re having issues getting runtime on a cpu – now there are many reasons for that, not just high cpu – and scheduler health may need to be checked, however, if we’re waiting to be scheduled and we’re using a technology such as Availability Groups then logically there will be extra waiting for any steps the Availability Group needs to perform, such as synchronous commit code. A quick example is let’s say the average wait time to be scheduled on a cpu is 5ms, in a vacuum that’s not a large amount of time. If your queries are able to run in less than a second, you might not even notice that these waits are occurring as the impact might not be very large to these individual queries or executions. Now, add in an Availability Group with a synchronous commit partner, where there are multiple asynchronous threads, queues, and steps to get the data captured, packaged up, sent across the network, response received, update internal data, signal other threads of the new harden. Phew. If we’re adding 5ms to each of those steps, we’re quickly adding in another 100 or so milliseconds, so now wonder we’re going to see hadr_sync_commit waits. This doesn’t mean there is an issue with the Availability Group because we see the waits, but that’s how this specific wait (just as an example as this is the most obvious AG wait) is impacted. This can have a downstream effect of having longer query DML execution times, which might hold up other processes – and this can spiral quite quickly.

Deciding to add high availability to a system might be done at the business level, but it also needs to be checked at a server performance level to see if it is viable. Too many times I see broad generalizations of “performance” which aren’t rooted in reality or not even knowing the performance health of your systems, which lead to poor choices in the HA space and ultimately a worse situation. If you don’t want downtime due to crap performance, make sure you’re not falling into many of the performance traps and are keeping an eye on your server and tuning/capacity planning accordingly.