One of the most common items that will cause a memory dump in SQL Server is a non-yielding scheduler (generally referred to as NYS). What the heck does that mean? Why would it cause a memory dump? Is there anything that can be investigated? Good questions, let’s take a look.
Stop, Yield, Blinking Yellow…
To understand what non-yielding means, we must understand what a yield is and why it occurs. I’m not going to deep dive the SQL OS scheduler and integration with Windows and signaling mechanisms, but I am going to discuss a few key items.
SQL Server uses cooperative scheduling (which could also be called non-preemptive scheduling), Windows (client and server) uses preemptive scheduling (fun fact, Windows used to use cooperative scheduling but changed to preemptive in Windows 95/NT3.51). It’s easier to understand preemptive scheduling first, which much as its name suggests allows for some work item to preempt another work item – it’s a way of saying that anyone can just take over and run at any time for any reason and are not guaranteed to run to completion or a “good” stopping point. In a preemptive design, any running task can be changed out at any point in time and may or may not be scheduled again, the task has no control over if it does or doesn’t run. Obviously, this is a great simplification of an otherwise extremely hard problem space, but it’s accurate. If preemptive means the task has no control over running or not running, then cooperative means all the tasks (I’m using tasks as a generic term and does not refer to a SQL Server Task object) must cooperate and coordinate with each other to share the execution time. In the cooperative model, each task decides when it is at a good point to let another task run and to stop monopolizing the available time and processing power which is called, yielding, as in they are yielding control over to the next task. This means that in cooperative scheduling, all our tasks need to be good citizens! A bad actor in there will muck up the whole system, it’s a delicate process.
To Yield or Not to Yield
Now that you know a in a cooperative scheduling environment such as SQL Server, the task needs to be nice to others by yielding intermittently so that others may run. Let me preempt (ha!) some questions and let’s talk about yielding.
The responsibility of the task to voluntarily yield itself is really on the logic (and sort of by the transitive property) the programmer who made it. Some tasks might be trivial (more on this later), some tasks are behemoths or are collections of other tasks, which means it generally depends on what you’re doing in the code to know if you’re in a good place or not. Some good ideas on yielding might be if there is a very long list, it might make sense to yield every so many traversals or operations, maybe after so much time has elapsed (though the time it would take to see your current time, calculate, and then yield is most likely more overhead than would be expected). Please note that while SQL Server uses cooperative scheduling, Windows still used preemptive scheduling (Linux does as well and not to get pedantic but there’s that whole Linux abstraction layer thing) which means at the end of the day, the OS still chooses which threads to run, though SQL Server can present only a few threads to run instead of potentially thousands, allowing for more orderly work and to have tasks that do have things to do (as opposed to those waiting for something) to potentially run.
Speaking of expectations, assumptions and expectations are part of the yielding expectations. Let’s take a “trivial” example with obtaining the private key of a certificate for some type of encryption, for S&G let’s say it’s Transparent Data encryption. Seems like an operation that should be fast, right? Go to either memory or disk, decrypt the key, give it to the application buffer. That’s an assumption that the key is local. What if the key lives in an HSM? Sure, now we have another few hops to get to the HSM. What if the HSM didn’t live close to the server (such as in the same datacenter)? Now we add in a few more hops, let’s say it’s in Azure Key Vault. Now we’re potentially traversing the internet, with networking. What if any point along that line has a failure, maybe a router somewhere in the aether goes down, or a switch port fails in a datacenter, it’ll take some time to reroute the traffic. Now the traffic is rerouted and TCP has already resent the packets a few times, it’s been how many *seconds* of elapsed time? What one might assume is a trivial operation could easily become a complicated mess of long wait times. If you’d like a less complicated closer to home example, have you ever had a message about IO taking longer than 15 seconds in your errorlog?
The above makes it sound like cooperative scheduling really sucks. In same cases, it does, mostly wherever it needs to interface with anything outside of its application boundary. In the example above, a Windows API call is outside of the boundary. To take care of this, these calls are generally transitioned from cooperative scheduling to preemptive scheduling inside of SQL Server. This isn’t free and comes with a lot of overhead, not to mention that other tasks might be waiting for this preemptive task to complete because they are dependent upon the data that is returned. In the TDE example, we can’t encrypt/decrypt the data or log data until the key is returned which would effectively grind SQL Server to a halt.
Because the item already running might not be well behaved, there is a background task responsible for checking the cooperative schedulers which ensures that they are proceeding forward and not stuck on said bad actor. Luckily, many items inside of SQL Server have straightforward names (unlike typical Microsoft marketing names – “Microsoft Windows Azure Skype For Enterprise Business With ML and AI In Containers On Hyper-Scale Big Data Compute Cloud”), in this case it’s called the “Scheduler Monitor”. The name might be familiar as it’s the task that also inserts information in the default health ring buffer inside of SQL Server. If you have a Non-Yielding Scheduler dump, the call stack would be akin to the stack referenced in the “Raiding the Evidence Locker” heading below – which makes it trivial to search via stacks to understand the dump reason.
Now that we’re done with the super dry backstory of cooperative scheduling and yielding, why does the non-yielding scheduler dump occur? There are a few reasons, but the main ones are that the dump is helpful in troubleshooting potential product issues and it could help (but not always) in root cause analysis for the long running task if it’s not a product issue. One example of a product issue might be that a list which holds some information (doesn’t matter what, really) is assumed to always be small, less than 100 items. Traversing said list should be relatively quick, it’s only 100 items… the assumption that the list will always be small may be flawed – especially in a product that has been actively developed for almost 30 years. Maybe some new feature changes the way some upstream item works, now that list could grow to 10 million items, that’s no longer trivially fast. Most of the time, NYS issues aren’t product issues (unless we assume any issue such as calling even a simple WriteFile Windows API is a “product issue” then really, you’re just a terrible person who blames everything on others) and have something to do with the environment in which SQL Server is running. Again, I said _most_ of the time.
The scheduler monitor watches, as discussed, for positive forward progress and monopolization of time on the schedulers. There is various logic to detect different circumstances (not all may be NYS), however if the task has not voluntarily yielded itself in a specific amount of time (generally 70 seconds of wall-clock time though that could change) the scheduler monitor jumps into action. There are various items that are completed as part of this process, however the result has two things we’re interested in. The first is an all threads minidump file, the second is a copy of the currently executing stack for the scheduler that was or is still non-yielding. There is one more extremely helpful piece of information which is written to the SQL Serve Errorlog.
Raiding the Evidence Locker
We have our two main sources of information:
- The minidump file itself
- The Errorlog file with more information
It’s possible to open the minidump up in your favorite debugger, but most people use windbg or WinDBGX, and point to the public Microsoft Symbol Server (V:\Symbols_Public is a local cache directory):
Once the dump is open, we can double check to see if it’s a non-yielding scheduler (though the dump comments should call this out) by finding the thread that dumped. SQLDumpr is the executable used to do the dumping, we can search the stacks for dump, dumping, sqldumper, etc., but if you list all the threads and look through, you should also find it. Below is an example of finding the dumping thread:
> !findstack sqllang!SQLDumperLibraryInvoke
Thread 030, 1 frame(s) match
Great, so debugger thread 0n30 has our dumping stack, let’s check it out:
Yep, a NYS dump. Now what? How do we get any more information with public symbols? This is where the errorlog comes in! Looking at the errorlog, you can find the dump by looking for the tell-tale dump signature:
Server * *******************************************************************************
Server * BEGIN STACK DUMP:
Server * 01/01/01 10:10:10 spid 1234
Server * Non-yielding Scheduler
Server * *******************************************************************************
Right below the dump header above, you’ll see something akin to the following:
Process 0:0:0 (0x22c0) Worker 0x0000000006A32120 appears to be non-yielding on Scheduler 8. Thread creation time: 17253840547306. Approx Thread CPU Used: kernel 62775 ms, user 3475 ms. Process Utilization 19%. System Idle 58%. Interval: 71590 ms.
Let’s break this down, quickly. The first part (0x22c0) is the Windows thread id of the thread that was non-yielding. Then we have the amount of kernel time (OS side) recently used by the thread, which is 62.7 seconds, followed by the user time of 3.5 seconds, and finally the wall-clock time of 71.5 seconds.
This helps us in a few ways:
- The total wall-clock time was ~71 seconds, so the thread ran for at least that long without yielding.
- The amount of user mode time was ~4 seconds, not much out of the total 71 seconds. Some user mode code has run, though most application time should be consumed in user mode.
- The amount of kernel mode time was ~63 seconds, which represents most of the overall time. This means calls into the kernel were taking time.
- We know the thread id of the offending thread.
Now that we know we’re most likely looking for kernel mode transitions (there are two stacks for threads, user mode and kernel mode – the kernel mode side cannot be viewed from user mode). Let’s switch to the offending thread and see what it looks like:
The above stack shows what was running _AS_ the dump was taken which shows we’re transitioning or transitioned into kernel code – in this case, NtWriteFile, which as you’ve guessed, is used to write to files. This was invoked by LazyWriter, since not much time is being spent in user mode code it’s unlikely the issue is with lazywriter itself. It’s more likely that the calls to NtWriteFile (by extension of WriteFile) are taking so long that a single or a few calls into the WriteFile windows API causes it to take 70+ seconds!
Finally, there is one more piece of information that we can glean from this, which isn’t intuitive at all. If you search the public symbols for items related to the stack (because apparently some of us have nothing better to do) then it’s possible to find what appears to be a global variable that might hold some useful data:
<lots of entries>
<lots more entries>
This holds various information, but as this post is almost the size of a book as it is, we’ll cut to the chase. I previously stated that some information was saved before taking the dump, one of those is the stack of the offending thread before the dump is taken (movement can occur between the time the dump is invoked and the time the thread is frozen). We can set the context by doing the following:
.cxr sqlmin!g_copiedStackInfo+0x20; kc
We’ll get the stack as it was before the dump occurred – this may or may not be the same! This can give some extra insight into what was happening before the dump.
You can see that the stacks do not match, but are still in LazyWriter and both have kernel mode items at the top of the stack. This coincides with the data from the Errorlog which stated that most of the time was spent in kernel mode. Not to repeat myself too often, but, since there we only have public symbols available it’s not possible to dig down too far (pedantically we can get some information based on the registers and knowing the Windows API that is called [we have both!])
Just to sum this up (don’t want to leave anyone hanging), if a dump with data similar to this had occurred in your environment, then checking the system in terms of the kernel and executive, along with what other applications are doing on the server would be the next step. Based on the data above this doesn’t seem to be a SQL Server specific issue. You’ll notice that a user mode memory dump is just a snapshot in time, it doesn’t hold what other applications are doing, the state of the system, any performance counters, etc., so set your expectations accordingly that memory dumps aren’t the end all be all of root cause analysis and generally won’t have all the data needed.