Error 35217 And Availability Groups /SMH

Well That Escalated Quickly…

Error 35217 translates to, “The thread pool for AlwaysOn Availability Groups was unable to start a new worker thread because there are not enough available worker threads. This may degrade AlwaysOn Availability Groups performance. Use the “max worker threads” configuration option to increase number of allowable threads.” which is fairly fatal but I’m not a huge fan of the following text, “Use the “max worker threads” configuration option to increase number of allowable threads”. This is basically saying, “Hey, just up the worker threads and you’ll be fine, trust me… I’m the product and everything will be A-OK.” except it won’t.

Let’s Have An Awkward Conversation.

You’re probably wondering why you couldn’t spawn a new thread, why this error happened, why you shouldn’t just up the max worker threads, and probably also thinking this must be some kind of “bug” in SQL Server. So here’s where our awkward conversation starts… It’s you. Contrary to every relationship breakup you’ve ever had, it’s definitely you. I’m not saying this to be mean but to really drive the point home here. The major reasons for this occurring are large blocking chains, too much workload for the server size (databases, users, etc.), and/or your virtual infrastructure sucks. There aren’t too many reasons for getting yourself into this situation, and while what I’ll be putting forth here isn’t exhaustive of all edge cases and scenarios, these are by far the majority of all the items in the wild that I’ve either worked on or have been involved in at some level. Side Note: If you’ve read this far, are shaking your head, calling me names that an irate sailor might utter, and telling yourself that upping the max worker threads as the product error suggests and Microsoft should fix their bugs then you can stop reading here as you’re probably not open to learning why you have issues in your environments.

Taking A Hard Look In The Mirror.

Let’s step through the major reasons this happens in a bit more detail and talk about how we can stop this from happening.

Blocking – Generally large swaths of it

Blocking causes all kinds of issues for various reasons. In this case, the reason is simple and straight forward in that each task is tied to a worker and each worker uses a thread but there are only so many threads that SQL Server will use. I completely recommend using the default configuration as this is a fair trade-off between different hardware resources. In general, this will end up being a deadlocked scheduler dump associated with the problematic time (but not always). When you have blocking, those tasks aren’t making progress, but they are still eating up a thread. So now you have this blocking chain 100 threads long, just sitting there and getting worse to the point where between the system tasks, HA tasks, and every other item on the server all together are hitting the maximum number of configured workers, which means when that next task comes in to SQL Server it won’t be able to create a new worker thread and you’ll receive this error. The fix is simple, monitor your systems and looking at why you’re having these large blocking chains. In the real world I’ve witnessed blocking chains of 900+ threads which immediately shows you how little people are looking at their servers. This might require some schema changes such as adding indexes, changing stored procedures or other application-based logic. Typical solutions to fixing blocking apply.

Workload – Specifically too much work for the hardware (or virtualization) apportioned

This will manifest not in large blocking chains such as was talked about above (though not to be ruled out), but in general sluggishness until the error occurs. Taking a quick look at the server and setup can give a quick indication if this is applicable to your situation. Generally when I talk to people about this with their infrastructure the default response is, “well most of those databases aren’t used often”… until they are, which is why this can be intermittent and random based on the usage patterns of the users. Just looking at the number of databases on the server (especially if they are in an AG) and the number of CPUs, if there are more than 5 databases per CPU before even adding in peak user loads then it’s already a good bet it’s CPU starved. This isn’t taking into account any of the most likely terrible virtualized setup your using (I’m looking at you, VM, configured with 7 numa nodes and 2 cpus per node). Each one of these databases is going to need multiple threads to run in an availability group, let alone the users running queries if it’s a primary or readable secondary. There can be only a few users on the system and still hit this error, for example, a server with 4 CPUs and 160 databases in an availability group is just too much for the resources given without even counting user load. The mitigation for this is to either add hardware or spread the load. Now when I say spread the load, I’m not talking about changing which servers are primary in the AG, I’m talking about taking them off that cluster with the low resource servers to a different cluster that isn’t overloaded.

Virtualization – Things like crappy hosts, overloaded hosts, etc.

This is extremely hard to diagnose from inside the virtual machine but will generally manifest in the typical ways which are no blocking but tasks are taking a long time, memory pressure inside the VM (such as seeing dynamic memory in play, ballooning, page file usage, etc.), seeing dispatch queues in the schedulers, long IO times, and general sluggishness of the server even when not under any or very light load. Sadly there isn’t anything SQL Server can do about your virtualization being crap. Please note that a hyperthreaded “logical” cpu is not a full fledged compute core and shouldn’t be counted as such.

What If I Just Up Max Worker Threads Like The Error Said?

Well, you’ll most likely transition from blocking to dumps of the deadlocked scheduler variety, hit a lease timeout because you have more threads than the hardware can handle, health check timeouts because we can’t complete health checking (SQL cluster DLL), Windows Clustering issues/health check errors, and a whole bunch more that will manifest in various ways. I can hear you now, because I’ve had people say this to me, “But Sean, I upped it by another 500 worker threads and we haven’t had any issues!”. Sure, but you didn’t fix the problem you just masked the symptom. If all you want to do is mask symptoms, I’m sure it’d pay better to become a doctor that just prescribes pills. If you want to fix the actual issue, then throw away changing the max worker thread value and look for the root cause of the issues, with the main ones being outlined above.

Posts