If you haven’t already read up on cooperative and preemptive scheduling or aren’t sure what those are, please read the intro to that first, otherwise you’ll be lost.
Much as I’ve discussed before, SQL Server uses a cooperative scheduling model. This doesn’t mean that Windows does, nor does it mean Windows will scheduler whatever SQL Server schedules, in fact much of the time there are many other threads that run before the ones from SQL Server, that’s the job of the operating system to figure out. Due to SQL Server using cooperative scheduling there needs to be a mechanism that exists such that when a resource not under SQL Server’s control needs interaction the scheduler can keep on scheduling and threads can switch in and out (in SQL Server, Windows does what Windows wants). Enter preemptive status and associated waits.
Normally when we think of preemptive things in SQL Server, the first thing that comes to mind are waits. This is natural as we’ve been told to tune/optimize based off of wait stats, and that’s true and helpful in almost all cases (some are red herrings most but not all of the time and there isn’t a hard and fast rule to many things). Many people are familiar with some of these preemptive waits: PREEMPTIVE_OLEDBOPS, PREEMPTIVE_OS_BACKUPREAD, PREEMPTIVE_OS_CREATEFILE, PREEMPTIVE_OS_CRYPTOPS, PREEMPTIVE_OS_PIPEOPS, XTP_PREEMPTIVE_TASK, these tend to be fairly common and can be _read_ in different ways. The general way is the first part describes the wait – in this case, preemptive – if there are only two parts such as PREEMPTIVE_OLEDBPOPS then the second part of the wait name describes what it is waiting on. If, however, there are three parts then the middle part is where the wait is waiting and in pretty much every case that’s going to be OS, which many time the text after OS is the API call (but not always such as BACKUPREAD or CRYPTOPS) such as CREATEFILE.
While we’re interested in all of these waits, what’s the point? Why do we even care about these if SQL Server isn’t directly responsible for them? Why can they cause performance issues with the server and instance? Great questions my astute readers.
When there is a wait, SQL Server goes off and attempts to do other things (the OS will do this too, even if the thread wasn’t swapped by SQL as OS’s are pretty smart these days with scheduling – minus some big.LITTLE architectural stuff). The problem is that the thread (task in SQL) is still running, it wants to do something… for example create a file on an SMB share. It doesn’t care where that share exists, it doesn’t care anything about the underlying infrastructure – such as being over a leased line at 10 Mbps 10k miles away with a latency of _seconds_ – SQL knows that it doesn’t have control over this specific item and thus needs a way to continue doing work, you didn’t pay thousands of dollars for enterprise edition to watch it wait for an SMB share to respond, did you? Where’s the rub, then? Well, SQL Server is cooperative, but since it doesn’t have control over what that outside code is doing or when it’ll get done (if it even does, it may never!) it can’t really switch to another thread (task). In comes preemptive threads and tasks.
If you’re not familiar with the SQL Server execution model (not queries, but infrastructure) then the 10 second 1-million-foot view is SQL Server has OS threads that do work. Those threads have a SQL Server Worker structure that keeps track of certain data points and handle tasks. Tasks are a structure that actually execute whatever the user or system asks it to in addition of tracking different data point, such as how much time it used, waits, etc. These workers are bound to a SQL Server scheduler and (although small incremental changes have occurred) are stuck to said scheduler for their lifetime. This means work does not automatically balance across schedulers.
NOT QUITE 99 PROBLEMS
In order for SQL Server to move on, it should (but isn’t REQUIRED) to make the task preemptive. This is entirely dependent upon the developer to implement and there have been changes to this throughout the years, not every place that waits on something outside of SQL Server gets a preemptive wait nor switches. The changes are generally to add preemptive switches so that a non-yielding scheduler does not occur. Great, so now we can mark a task preemptive, but what does that even do? Since SQL Server is cooperative it needs to mark the task preemptive, then it needs to do some internal state changes, and finally can switch in another thread. Now you’re probably sitting there going, “ok, that doesn’t sound bad, I don’t see a problem with this.” and you’d be right for the most part. The first part of the issue is that switching between preemptive and non-preemptive modes for the tasks aren’t free. There is more code that needs to run, and it also means that if the task completes right away, then it needs to wait to get back on the scheduler, so this puts it at an immediate performance disadvantage.
The second part of the issue is that preemptive tasks can live for a long time. This might not sound so bad, and it generally isn’t, unless these start to pile up. SQL Server is set to spin up so many OS threads for use on tasks – this is the max worker threads setting – and it pretty much never needs to be touched, so if you’ve touched it, you’re most likely already doing it wrong. Each of these will end up taking an OS thread for the lifetime of the task. A few of these shouldn’t be an issue, especially on larger servers, however it seems to be fairly typical to undersize the hell out of servers these days, mostly due to the use of cloud where everything is insanely expensive (but hey, your CTO can say they use cloud.. and probably some other word salad terms). That 4 core VM might not do well when it’s overwhelmed with many long running preemptive tasks as these are leaving the bounds of SQL Server and SQL Server is going to try to execute all the other queries and items you’re throwing at it. Imagine, as a trivial example, that you decide to setup some tasks that run with xp_cmdshell, and that task happens to be running a process that ends up eating 5% cpu time each second. If you run one of those, you might not feel it, but by the 10th one running, half of all your cpu is gone. Just because items go preemptive doesn’t mean they are free. Additionally, this eats up 10 threads, and if you only had a few hundred to begin with… after the system tasks, and basic infrastructure needs, you might end up running out of max worker threads, and then you’re really in a pickle.
The third part of the issue is that you’re giving the OS more items to schedule. Remember that the point of the SQL Server scheduler was to make a single work item available to the OS (thread) available to schedule so that different work items inside of SQL Server get roughly similar time, that no one item is starved (unless you do dumb things with resource governor), that things which can’t make forward progress (such as waiting on a lock) don’t get scheduled as a waste of time, etc., and now you’re giving the OS not only all the preemptive threads to schedule (and vie for cpu resources) but also all of the items on the schedulers for SQL Server after it moved on from the preemptive task. This means more context switching, less time on each thread, more waits on the OS to schedule, and no deterministic solution to the scheduling from the OS level. Might just make some of those queries start taking longer as they wait for cpu… just saying.
One of the easiest ways to understand how things work is with extended events (XE). They aren’t very user friendly, they aren’t the easiest to work with, but they are pretty powerful in helping one dive into the inner workings of SQL Server. Take a quick look at the catalog items for events that are interesting with preemptive items, you’ll find “wait_info_external” which has the descriptive text of, “Occurs when a SQLOS task switches to preemptive mode to execute potentially long running operations.” which seems like a good place to start. Grabbing the callstack and the loaded module information for symbol resolution (there have been changes to add module and offset so that loaded modules isn’t required but these aren’t in all versions or live in all CUs yet) is a must. Let’s take a look at a few items that this captures.
-- 1 -- sqldk!XeSosPkg::wait_info_external::Publish sqldk!SOS_Task::PushWait sqldk!MemoryClerkInternal::AllocatePagesWithFailureMode sqldk!MemoryClerkInternal::AllocatePages sqldk!TVarPageMgr<0>::PbAllocate sqldk!CMemObj::Alloc sqldk!operator new sqlmin!GetSqlServrIMetadataAccess sqlmin!GetSqlServrIMetadataAccess sqllang!CSECAuthMedAccess::InitReadOnly sqllang!CreateLoginTokenFromHandle sqllang!CSecurityContext::GetLoginToken sqllang!CSecurityContext::GetSecContextToken sqllang!CSecurityContext::FIsSysadmin sqllang!ISECManager::FCheckQueryPermissions -- 2 -- sqldk!XeSosPkg::wait_info_external::Publish sqldk!SOS_Task::PushWait sqlmin!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive sqlmin!CTraceIoBuf::IncrementalFlush sqlmin!CTraceFileIoProvider::IncrementalFlush sqllang!CTraceController::IncrementalFlush sqllang!CTraceManagementTask::Invoke sqllang!CTraceBatchTask::Task sqldk!SOS_Task::Param::Execute sqldk!SOS_Scheduler::RunTask sqldk!SOS_Scheduler::ProcessTasks sqldk!SchedulerManager::WorkerEntryPoint sqldk!SystemThreadDispatcher::ProcessWorker sqldk!SchedulerManager::ThreadEntryPoin -- 3 -- sqldk!XeSosPkg::wait_info_external::Publish sqldk!SOS_Task::PushWait sqllang!XpCmdshellExec sqllang!SpCmdShell sqllang!CSpecProc::ExecuteSpecial sqllang!CXProc::Execute sqllang!CSQLSource::Execute sqllang!CStmtExecProc::XretLocalExec sqllang!CStmtExecProc::XretExecExecute sqllang!CXStmtExecProc::XretExecute sqllang!CMsqlExecContext::ExecuteStmts<1,1> sqllang!CMsqlExecContext::FExecute sqllang!CSQLSource::Execute sqllang!process_request sqllang!process_commands_internal sqllang!process_messages
A few interesting notes already, about the above items. First, we can see the XE items at the top of the stack, that’s expected as we asked it to grab the callstack when this wait occurs. Second, different code paths call different functions – for example some add the wait in and do other things, some switch preemptive and then push the wait. Third, a function that looks very interesting shows up in #2 – sqlmin!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive.
XE gave us a starting point, now we can attach a debugger and do a little more investigation. One of the first things that can help is looking at the publish symbols and seeing what we can resolve. In this case I looked for anything that had the word “preemptive” in it, which yielding a surprisingly large number of results. However, this can be whittled down and I chose a few that looked the most promising along with the one we already identified via the XE data.
sqlservr!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive sqllang!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive sqlmin!SOS_Task::SwitchToPreemptiveMode sqlmin!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive sqldk!ThreadScheduler::SwitchPreemptive
Notice that most of these have “SOS_TASK” as their base class. This shouldn’t be too surprising, given that we know tasks are the executable unit in SQL Server. Here is a stack which should look quite familiar to many of those that use Availability Groups with SQL Server:
00 00000064`150fbd88 00007ff8`3f471e62 sqlmin!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive 01 00000064`150fbd90 00007ff8`3f472251 sqlmin!HadrWsfcUtil::OpenLocalCluster+0x42 02 00000064`150fbe40 00007ff8`3f4ecada sqlmin!HadrWsfcUtil::GetClusterName+0x91 03 00000064`150fbf20 00007ff8`3ee698d1 sqlmin!HadrClusterTable::InternalGetRow+0x8a 04 00000064`150fbfd0 00007ff8`3ed05a89 sqlmin!CQScanTVFStreamNew::GetRow+0xac 05 00000064`150fc030 00007ff8`3ed15eee sqlmin!CQScanLightProfileNew::GetRow+0x19 06 00000064`150fc060 00007ff8`426307ff sqlmin!CQueryScan::GetRow+0x80 07 00000064`150fc090 00007ff8`426309dc sqllang!CXStmtQuery::ErsqExecuteQuery+0x3de 08 00000064`150fc200 00007ff8`4262b327 sqllang!CXStmtSelect::XretExecute+0x373 09 00000064`150fc2d0 00007ff8`4262c3ac sqllang!CMsqlExecContext::ExecuteStmts<1,1>+0x8f8 0a 00000064`150fce70 00007ff8`4262ba85 sqllang!CMsqlExecContext::FExecute+0x946 0b 00000064`150fde50 00007ff8`42634865 sqllang!CSQLSource::Execute+0xbc3 0c 00000064`150fe150 00007ff8`4263b3b6 sqllang!process_request+0xdf3 0d 00000064`150fe8a0 00007ff8`4263b4d3 sqllang!process_commands_internal+0x4b7 0e 00000064`150fe9d0 00007ff8`3e7ba7e3 sqllang!process_messages+0x193 0f 00000064`150feb90 00007ff8`3e7ba45f sqldk!SOS_Task::Param::Execute+0x232 10 00000064`150ff190 00007ff8`3e7ba01e sqldk!SOS_Scheduler::RunTask+0xbf 11 00000064`150ff200 00007ff8`3e7df912 sqldk!SOS_Scheduler::ProcessTasks+0x39d 12 00000064`150ff320 00007ff8`3e7df334 sqldk!SchedulerManager::WorkerEntryPoint+0x2a1 13 00000064`150ff3f0 00007ff8`3e7df74a sqldk!SystemThreadDispatcher::ProcessWorker+0x42a 14 00000064`150ff6f0 00007ff8`54f481f4 sqldk!SchedulerManager::ThreadEntryPoint+0x406 15 00000064`150ff7e0 00007ff8`577aa251 KERNEL32!BaseThreadInitThunk+0x14 16 00000064`150ff810 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Looking at the above, we can see that on frame 0xf we’re executing a task, this is in turn processing a request which ends up being some type of T-SQL request on frame 0xb. This T-SQL is in turn executed which was obtaining rows from a DMV which uses the HadrClusterTable TVF (frame 0x3). This class seems to want to get information about the local cluster (frame 0x1) and finally switches to be preemptive on frame 0. The code obviously executes more, but this is where we found a preemptive switch.
Let’s deep dive into another stack and see how this all comes together…
00 00000098`aba7cdd8 00007ffc`ebba52dd sqllang!SOS_Task::AutoSwitchPreemptive::AutoSwitchPreemptive 01 00000098`aba7cde0 00007ffc`ebba5249 sqllang!CServerHeartbeat::UpdateUptimeRegKey+0x5d 02 00000098`aba7dec0 00007ffc`ebb5835a sqllang!CHeartbeatTask::TimerFunction+0x79 03 00000098`aba7df90 00007ffc`e825ad30 sqllang!PeriodicTimerTask::ProcessTskPkt+0xa 04 00000098`aba7dfc0 00007ffc`e825a924 sqlmin!TaskReqPktTimer::ExecuteTask+0xe0 05 00000098`aba7e100 00007ffc`ebb5508c sqlmin!OnDemandTaskContext::ProcessTskPkt+0x3a4 06 00000098`aba7e300 00007ffc`e825a635 sqllang!SystemTaskEntryPoint+0x381 07 00000098`aba7f2a0 00007ffc`e7cea7e3 sqlmin!OnDemandTaskContext::FuncEntryPoint+0x25 08 00000098`aba7f2d0 00007ffc`e7cea45f sqldk!SOS_Task::Param::Execute+0x232 09 00000098`aba7f8d0 00007ffc`e7cea01e sqldk!SOS_Scheduler::RunTask+0xbf 0a 00000098`aba7f940 00007ffc`e7d0f912 sqldk!SOS_Scheduler::ProcessTasks+0x39d 0b 00000098`aba7fa60 00007ffc`e7d0f334 sqldk!SchedulerManager::WorkerEntryPoint+0x2a1 0c 00000098`aba7fb30 00007ffc`e7d0f74a sqldk!SystemThreadDispatcher::ProcessWorker+0x42a 0d 00000098`aba7fe30 00007ffc`ff3181f4 sqldk!SchedulerManager::ThreadEntryPoint+0x406 0e 00000098`aba7ff20 00007ffc`ff45a251 KERNEL32!BaseThreadInitThunk+0x14 0f 00000098`aba7ff50 00000000`00000000 ntdll!RtlUserThreadStart+0x21
We can already pick up a few things from this stack. First, it’s a system task, not a user requested stack as we can see in frame 0x7. System tasks can be short or long lived, up to the point they are marked or considered “permanent” tasks as in they will always be there. I’ll leave this as an exercise, try to find some using T-SQL. Second, this is obviously going preemptive, but why? In frame 0x1 we can see that it’s executing a function called “UpdateUptimeRegKey” and from the name one would suspect it’s reaching out to update a registry key. Generally, registry updates take virtually no time at all and complete extremely fast, however there are a load of different ways one could intercept and play with how this works. Registry manipulation ends up calling kernel functions, additionally there is a plug-in framework in the kernel for registry operations (how do you think procmon and such capture all of these events?) and your favorite anti* software does indeed install itself to monitor these types of operations (among a litany of others). Thus not only are you transitioning to kernel mode, but you’re at the mercy of other software intercepting and doing who knows what before allowing (or blocking, or never returning) the operation to continue.
If we go past the initial function call and let the preemptive switch do its thing, we can then freeze the thread and take a look at SQL Server, so let’s do it! I froze the thread after the thread switches to preemptive mode, let’s see what we can find out about this. The first thing we know is the wait type, it’s going to be preemptive something, right? So let’s take a look…
SELECT session_id, start_time, status, command , database_id, user_id, connection_id, wait_type , last_wait_type, cpu_time, wait_time, scheduler_id , task_address FROM sys.dm_exec_requests WHERE session_id != @@SPID and wait_type like 'PREEMPTIVE%' or last_wait_type like 'PREEMPTIVE%'
This returns just a single result on my test instance, mostly because nothing else is running (no other workloads) and most preemptive waits take very little to no time, thus catching one is really about timing than anything else. I let the frozen thread sit around a while to make a point. The first point is about the non-yielding scheduler or lack thereof, if I look in my errorlog I have no entries, no dumps, nothing. This is as it should be, SQL Server said, “Hey this thing is going to run and is out of my control, we know about it, and it’ll eventually finish. Until then, don’t worry about it.” so the checks for non-yielding don’t apply since it’s not “on” the scheduler per se and other items are being switched in and out. The second is the accounting of cpu time, which continues to increment, even though the thread is frozen and unable to run. This means the cpu times are more elapsed time than burned cpu time.
|70||11/27/22 10:48 AM||background||TASK MANAGER||1||1||NULL||PREEMPTIVE_OS_QUERYREGISTRY||PREEMPTIVE_OS_QUERYREGISTRY||9142121||9083167||3||0x0000024CF8F604E8|
We can see that this is an actively running query (it’s in sys.dm_exec_requests), we also see there’s a task address, so let’s go take a look at the task. We know that tasks don’t sit out on their own, they are bound to workers in a 1-1 fashion, so we might as well get that data too.
You can see in the above that SQL still shows this thread ad running, and according to SQL Server that’s true… however, according to Windows the thread is actually frozen. Here we can see that the associated user and kernel time are more accurately reflected than cpu time in dm_exec_requests. This is indeed on scheduler 3, is session 70, etc., etc., and has an os_thread_id of 4892. If we compare that to our debugger, we can see this is indeed the case. The client id (which is what it’s called in Windows) is process.thread – in this case is ee8.131c which the thread portion in decimal is 0x131c = 4892, also notice that status of the thread is “Frozen”.
64 Id: ee8.131c Suspend: 1 Teb: 00000098`a30cb000 Frozen
Preemptive threads in SQL Server aren’t free, they still cost worker threads, cpu cycles, memory, etc., and compete at a Windows level for cpu resources. Too many of these can start to cause problems, and it’s not easily identifiable which are causing the problem as the typical DMV results (sys.dm_exec_requests) don’t really show the whole picture. What SQL Server believes is true and what is true can be two different things as shown with the cpu usage, thread times, thread states, etc., and thus you’ll need the output of all of these plus the Windows side output to really understand what’s going on. Scheduling in SQL Server doesn’t mean that’s how Windows will schedule it, the longer the list, the more items to schedule, the less time each thread has per unit of time.