If you’re using availability groups with read-scale or linux (cluster type = None/External), you might want to watch the number of databases put into a single availability group. There seems to be an issue where having a very large number (~200) of databases in an AG works without issue, however going higher may result in constant asserts happening, causing dumps to be taken.
If you’re getting close to around 200 databases, the workaround is to just create another availability group and put any other databases in that. Rinse, repeat.
What’s The Issue
Without getting too far into the depths of UCS and AGs, in Windows Clustering the cluster database (registry) is used to store AG metadata in non-human readable format. Since clustering takes care of updating the cluster databases on each node, this is done seamlessly for SQL Server and just occurs. Each node then has a proper (hopefully) copy of the cluster database so that failovers or other operations that require checking the metadata of an AG work appropriately.
Read-Scale and Linux AGs don’t have a cluster database and both need to use some other method to keep the AG metadata up to date on each node. This is done via messages between the nodes on the UCS transport (like every other message used for AlwaysOn). Each database takes a certain amount of space in the in-memory AG metadata, when this size becomes too large, it starts to hit into limits of UCS functionality as it currently exists. Note that WSFC based AGs do not have this problem – though this is a technical limit to the size of a registry key it is highly unlikely that this would be hit before some other resource exhaustion takes place, such as worker threads.
Sample Dump Comment
Expression: m_pcscBoxcar->GetMessageCount () > 0