Demystifying Dumps: Stalled Dispatcher

Hey Dispatch, We Have a Code 8

If you’ve been following with the series of demystifying dumps then you’re probably thinking or have thought something along the lines of, “All of these issues seem to be around stalled or non-yielding things…” and you’d be correct. If you revisit the non-yielding scheduler dump post, I explain a little about cooperative scheduling and how things need to place nice, if one piece isn’t playing nice by doing its part in a timely manner or not giving up time to others, that’s going to be a very large problem for the system in general. The stalled dispatcher falls into the same set of issues.

There are various dispatcher pools in SQL Server, though the most widely known one is for extended events. Dispatcher pools are items that hold pieces of work to be completed, like a dumping ground for “someone pick this up and do this” type of tasks. If it helps, you can think of them in a similar way as connection pooling, where there are background threads which will do something (like making or updating connections) when needed. It can respond to pressure by making new threads if needed or trim threads when the work slows down. In SQL Server dispatchers set their own attributes including when an item is considered stuck, the number of threads to use, etc., which means some may never be considered stalled as they are able to disable their own timeouts.

When it comes to a stalled dispatcher and creating a memory dump, the scheduler monitor (which is a common point of dump generation as it is tasked with checking the health of various components as discussed in previous posts) is in charge of checking the health and generating any dumps for dispatchers. You’ll end up seeing the callback of sqlservr!SQL_SOSStuckDispatcherCallback on the stack for scheduler monitor, this is what ends up, ultimately, invoking the dump logic based on the settings of each particular dispatcher. If you’re head is swimming a little bit, it’s ok, just know that each dispatcher is responsible for itself and only itself and might be configured completely different than another dispatcher, which is why certain dispatcher dumps are more common than others (which may never dump).

The Repeater Game

Since the non-yielding/stalled dumps are all similar in debugging, I’m not going to spend much in the way of explanation for that part as the other posts cover it well enough. I will point out the difference in the errorlog, which an example looks as such:

Dispatcher (0x3a9c) from dispatcher pool 'XE Engine main dispatcher pool' Worker 0x0000022F0CE84160 appears to be stuck on Node 0. Approx CPU Used: kernel 0 ms, user 0 ms, Interval: 240026.

The key take aways from the errorlog output is that we have the thread id and the worker address. Sadly, since public symbols don’t have internal structures, the worker address won’t be of much help, whereas the OS thread id is of great help! In the entry above, the OS thread id is 0x3a9c and the worker address is 0x0000022D0DD64160 along with the fact that is for the “XE Engine” which points us towards extended events usage. Using this information, along with the syntax and skills we learned in previous posts, the debugging would look like this:

~~[0x3a9c]s; kc;

00 ntdll!NtWaitForMultipleObjects
01 KERNELBASE!WaitForMultipleObjectsEx
02 KERNELBASE!WaitForMultipleObjects
03 sqldk!XE_DispatcherQueue::Dequeue
04 sqldk!SOS_DispatcherPool<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::Dequeue
05 sqldk!Dispatcher<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::Run
06 sqldk!XE_SessionSrv::DispatcherEntryPointWithContext
07 sqllang!XESQLSetupThreadContext
08 sqldk!DispatcherEntryPoint
09 sqldk!SOS_DispatcherPool<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::RunDispatcher
0a sqldk!SOS_Task::Param::Execute
0b sqldk!SOS_Scheduler::RunTask
0c sqldk!SOS_Scheduler::ProcessTasks
0d sqldk!SchedulerManager::WorkerEntryPoint
0e sqldk!SystemThreadDispatcher::ProcessWorker
0f sqldk!SchedulerManager::ThreadEntryPoint
10 kernel32!BaseThreadInitThunk
11 ntdll!RtlUserThreadStart

The above will set our thread context to the stalled dispatcher thread that caused the issue, then dump the call stack without all the extra information we don’t need (and really doesn’t help for these situations). Remember that this is the call stack as the dump was generated which means it’s entirely possible that the thread made forward progress between the time the health check occurred and the time the thread was frozen. You can see in frame 3 that it’s sitting in the Dequeue function with a WaitForMultipleObjects WinAPI call. This means it’s waiting for work… and waiting for work won’t cause a stalled dispatcher. Knowing this, let’s look to see if we have a copy of the stack which was taken before the dump occurred just in case the thread has made forward progress.

.cxr sqlmin!g_copiedStackInfo+0x20; kc;

00 ntdll!NtWriteFile
01 KERNELBASE!WriteFile
02 sqldk!XE_FileWriter<XE_FileWriterDefaultPolicy<1,&XEPackage0::DefaultFileExtension>,XE_MetadataTracker>::WriteLog
03 sqldk!XE_FileWriter<XE_FileWriterDefaultPolicy<1,&XEPackage0::DefaultFileExtension>,XE_MetadataTracker>::WriteBufferInternal
04 sqldk!XE_FileWriter<XE_FileWriterDefaultPolicy<1,&XEPackage0::DefaultFileExtension>,XE_MetadataTracker>::WriteBuffer
05 sqldk!XEPackage0::XE_FileTarget::ProcessBuffer
06 sqldk!XE_SessionBasicTargetImpl::ProcessBufferAsync
07 sqldk!Dispatcher<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::Run
08 sqldk!XE_SessionSrv::DispatcherEntryPointWithContext
09 sqllang!XESQLSetupThreadContext
0a sqldk!DispatcherEntryPoint
0b sqldk!SOS_DispatcherPool<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::RunDispatcher
0c sqldk!SOS_Task::Param::Execute
0d sqldk!SOS_Scheduler::RunTask
0e sqldk!SOS_Scheduler::ProcessTasks
0f sqldk!SchedulerManager::WorkerEntryPoint
10 sqldk!SystemThreadDispatcher::ProcessWorker
11 sqldk!SchedulerManager::ThreadEntryPoint
12 kernel32!BaseThreadInitThunk
13 ntdll!RtlUserThreadStart

The before stack looks more like a thread that was stuck doing work, and if we look at what the thread was doing from frame 6 through frame 0, it was attempting to process a buffer that was in the work pool which resulted in writing that buffer to disk. Now, it could be that the write took a long time, it could be there were multiple writes that took a long time, it could be this is over some network share that is slow or otherwise disconnected and waiting on reconnect logic, etc., however we know that we should be investigating what XE sessions we have enabled on the server. Double checking there aren’t overlapping event capturing or having high impact captures such as the post execution showplan xml would also be helpful. The key thing is that while this did occur with a writefile at the top of the stack, much of the time could have been spent elsewhere (beware as some targets are synchronous and some are asynchronous). In this example, we know from the errorlog output line that kernel mode usage was 0 ms and user mode usage was 0 ms, which points to a few other different items. One is that the write was sent by Windows and was a synchronous write which caused it to wait for a very long time, the thread did not get any time on the processor, the IO was lost/dropped/waiting on the underlying storage subsystem, and a host of other issues. Use your best judgement and attempt gather more data that might be available in order to make an informed decision.

Posts

Demystifying Dumps: Stalled Dispatcher

Hey Dispatch, We Have a Code 8

The Repeater Game

1 thought on “Demystifying Dumps: Stalled Dispatcher”