Cluster Resource DLL Connectivity In Availability Groups

I was recently asked a question, while fixing a cluster connectivity issue, about how the cluster resource dll knows on which port to connect to SQL Server. It’s not a very exciting answer, which is, “It doesn’t.”. After some days passed I was thinking about the question and answer as though the response I had given is correct, it doesn’t seem to make much sense as to how this whole thing works.

Look At The Registry

There is a large amount of connective glue between the cluster registry and availability groups (doing anything in the registry, let alone the cluster registry, by hand, pretty much voids the warranty – so please don’t mess with it) which can make the whole process a bit more complicated than expected. When it comes to availability groups, each availability group has a map between the group, the replica, and the instance of SQL Server. We’re concerned about the mapping between the replica and the instance, specifically when it comes to connectivity. This is done by the registry key “SqlInstToNodeMap” which lives under the availability group resource. This holds a set of values which are the name of the instance and the name of the node or computer, such as below.

Here you can see that there are two replicas, the first being SQL22C2AGN1 and the second being SQL22C2AGN2. The first one, SQL22C2AGN1, is a default instance and the second, SQL22C2AGN2, is a names instance of “INST2”. You’ll notice there are no port numbers or any other pieces of information in here, so how do we know how to connect to the instances, especially given the second is a named instance? That’s the job for client connectivity…

Client Connectivity Configuration

This is the part where, when setting up an instance, you choose things like enabling SharedMemory, NamedPipes, TCP/IP, or not. The configured ways which are available for clients to connect may change the ability to connect! Since the resource dll is only ever run local to the replica upon which is primary, this means that it’ll always be a local connection. This is where client protocols such as SharedMemory (which is only local) or NamedPipes (can be networked) come into play which makes it very easy an no need to worry about things such as port number lookups.

The Actual Connection

The connection string used is the same across all versions except for the driver. When it comes to SQL Server 2019 and below, this will use the Native Client 11 driver, SQL Server 2022 (currently the latest released version) uses ODBC 17. Here is (without the driver) the connection string that is used, which can be found by a connectivity trace, `SERVER=<ServerName>;Trusted_Connection=yes;DATABASE=master;` A few things you’ll notice: it’s going to use the account which is used to execute this code (which can be different at various levels such as thread and process) and that this account needs access to the master database. This is a big problem I end up seeing and fixing, failure to have a login for the SYSTEM account, let alone any permissions.

When it comes to the SERVER part of the connection string, the typical SQL Server instance name is used. For example, I have SQL22C2AGN1 and SQL22C2AGN2\INST2 and that is exactly what would be used in the connection string for SERVER. There is no protocol in the connection string specified and thus, no port number or specific pipe.

Bringing It All Together

The cluster registry is used to find the replica which is now the local replica by looking at the values and seeing which matches the local server name. Once this is found, that name is then used as the servername in the connection string, and the connection is attempted like any other typical connection. Browser might be needed/invoked and that’s about the only thing that I tend to see hitting people in the face. It just works, it’s hard to mess up.

Where I’ve watched this cause a great number of problems are on “hardened” servers, and I use that in quotes on purpose.

Security Through Obscurity and Asininity

I have a particularly staunch viewpoint on security through obscurity and hardening, especially when it comes to connectivity. Playing dumb games wins you dumb prizes. Doing things like only allowing TCP/IP as a protocol and then changing the default instance’s port from 1433 to something non-standard… because.. SeKuRiTaH… is a great way to mess all of this up. I haven’t had anyone do this yet, and it’s not a challenge I’m putting out there, but if you change the port on a default instance and also force just TCP/IP as a protocol, your AG will absolutely fail to come online. Are there ways around this? Yes, but the best thing is to not shoot yourself in the foot for no reason. Leave the default ports.

Hardening… which is a misnomer in my opinion… is the other area where I do see quite a few issues. I’m all for having the least privileges needed, but there are certain configurations which just need to be left alone. This is one of them. Removing logins for required accounts such as system are going to stop you dead in your tracks. There’s no way around that, that’s needed. So, when the team requiring this comes to you, point them to the documentation. If they don’t accept that, sheer force of will is not going to change how the system works, either change the configuration to allow things to work or find a different system to use.