Read Scale Availability Group – Failed to update Replica status due to exception 35222.

Read Scale Availability Groups can be pretty useful in the right places and for the right things and were a latest feature update for AGs until Contained AGs came along in 2022. Read Scale AGs don’t integrate with clustering of any type and they behave somewhat as mirroring used to where there is no real coordination of resources and it is up to the administrators to make the proper judgement calls or automate whatever possible scenarios they deem important.

Due to the fact that read scale doesn’t use any underlying clustering it instead uses its own mechanism which again came after the initial cluster integrated feature. It’s not uncommon for updates to features (especially given the scenario coverage) sometimes miss one or two things here and there, no test coverage is perfect and no person infallible. This brings us to quite the oddity I’ve witnessed for read scale availability groups (though this can happen with cluster integrated as well but for different reasons which should be investigated) which surfaces as a message in the errorlog:

Failed to update Replica status due to exception 35222.

That’s ominous. If we take a look at error 35222, according to SQL Server it means:

Could not process the operation. Always On Availability Groups does not have permissions to access the Windows Server Failover Clustering (WSFC) cluster. Disable and re-enable Always On Availability Groups by using the SQL Server Configuration Manager. Then, restart the SQL Server service, and retry the currently operation. For information about how to enable and disable Always On Availability Groups, see SQL Server Books Online.

Ok, sure, except we aren’t using a WSFC here. If there is code attempting to open a handle of some sort to the cluster for read scale availability groups, it’s going to always fail as there is no cluster… on purpose. Is this benign?

Grabbing the stack when the issue occurs shows the following which does have HadrWsfcUtil::OpenClusterResourceKey on the top, so it’s definitely trying to interact… or at least fake interact… with the cluster.

14 sqlmin!HadrWsfcUtil::OpenClusterResourceKey
15 sqlmin!HadrReplicaOnlineManager::ReplicaOnlineUpdateClusterReg
16 sqlmin!HadrReplicaOnlineManager::HandleLogScanStateChange
17 sqlmin!HadrWorkItem::Execute
18 sqlmin!HadrWorkRoutine
19 sqldk!SOS_Task::Param::Execute
1a sqldk!SOS_Scheduler::RunTask
1b sqldk!SOS_Scheduler::ProcessTasks
1c sqldk!SchedulerManager::WorkerEntryPoint
1d sqldk!SystemThread::RunWorker
1e sqldk!SystemThreadDispatcher::ProcessWorker
1f sqldk!SchedulerManager::ThreadEntryPoint
20 kernel32!BaseThreadInitThunk
21 ntdll!RtlUserThreadStart

If we disassemble the function (public symbols only I’m afraid) and try to see what it’s doing, we can see a few calls to potentially problematic functions:

GetClusterResourceByIdOrName
OpenClusterResourceKey

If we take a look at a read scale availability group in the DMVs we can see that the resource_id is the same GUID as the group_id, and the resource_id is the cluster resource GUID for that specific AG. We don’t have a cluster, as previously mentioned, but if you look at other DMVs you’ll find that read scale was made to “look” like all of the cluster items are filled in.

It seems like this is just a simple case of not checking to see if the availability group was read scale before attempting to update something in the cluster. It’s benign and doesn’t cause any issues that I could tell, aside from writing frightening entries in the errorlog.