Attempting to recycle all monitors that have loaded the DLL sqsrvres.dll or hadrres.dll and you notice a SQL FCI/AG failed health checks

I won’t say I told you so…

I will, however, point out common AG question #8… I’m not sure why so many of these are coming up lately, but I’ve not had to troubleshoot a few, and I think it’s worth documenting this behavior.

Let’s say you have Windows cluster which might have FCI’s, AG’s, or a combination of both but it definitely has more than a single instance installed. Eventually (hopefully, at least) patching is done on one instance and you notice that all of the other instances on the server being patched which are either primary (AG) or the owning node (FCI) have an availability outage as the instances go offline for a short period and (should) automatically recover and start back up (FCI) or come online (AG).

This happens due to the way clustered applications are written against the cluster standards. In WSFC there is something called RHS whose job is to do cluster resource specific actions which mainly includes health checks. This is that “IsAlive” and “LooksAlive” that everyone like to talk about but rarely get too involved in discussing. When a new version of the cluster specific resource DLL replaces the old one, the cluster notices this and asks all of the RHS processes that have the older version loaded to unload and recycle their process to pick up the new resource. This might not cause problems with some resource types as their static checking is relatively simple, however when it comes to some of the more complicated checks that SQL Server does, it presents an issue. This will, in almost all cases, cause a health check failure and cause at least a small blip in service uptime – though most SQL HA items recover fast enough that clients generally don’t notice.

The tell-tale sign where I know someone or something did patching is the following:

“[RCM] Attempting to recycle all monitors that have loaded the DLL …”

Once I see this, I know what happened, and it’s a simple act to figure out when and who did some patching. I think we can see the two obvious solutions to this:

  1. Don’t have more than a single instance in your FCI/AG setup
  2. If you must use multiple instances (because you hate having free time to enjoy life) then only patch a node when it is a secondary (AG) or not the current owning node (FCI)