Hey Dispatch, We Have a Code 8
If you’ve been following with the series of demystifying dumps then you’re probably thinking or have thought something along the lines of, “All of these issues seem to be around stalled or non-yielding things…” and you’d be correct. If you revisit the non-yielding scheduler dump post, I explain a little about cooperative scheduling and how things need to place nice, if one piece isn’t playing nice by doing its part in a timely manner or not giving up time to others, that’s going to be a very large problem for the system in general. The stalled dispatcher falls into the same set of issues.
There are various dispatcher pools in SQL Server, though the most widely known one is for extended events. Dispatcher pools are items that hold pieces of work to be completed, like a dumping ground for “someone pick this up and do this” type of tasks. If it helps, you can think of them in a similar way as connection pooling, where there are background threads which will do something (like making or updating connections) when needed. It can respond to pressure by making new threads if needed or trim threads when the work slows down. In SQL Server dispatchers set their own attributes including when an item is considered stuck, the number of threads to use, etc., which means some may never be considered stalled as they are able to disable their own timeouts.
When it comes to a stalled dispatcher and creating a memory dump, the scheduler monitor (which is a common point of dump generation as it is tasked with checking the health of various components as discussed in previous posts) is in charge of checking the health and generating any dumps for dispatchers. You’ll end up seeing the callback of
sqlservr!SQL_SOSStuckDispatcherCallback on the stack for scheduler monitor, this is what ends up, ultimately, invoking the dump logic based on the settings of each particular dispatcher. If you’re head is swimming a little bit, it’s ok, just know that each dispatcher is responsible for itself and only itself and might be configured completely different than another dispatcher, which is why certain dispatcher dumps are more common than others (which may never dump).
The Repeater Game
Since the non-yielding/stalled dumps are all similar in debugging, I’m not going to spend much in the way of explanation for that part as the other posts cover it well enough. I will point out the difference in the errorlog, which an example looks as such:
Dispatcher (0x3a9c) from dispatcher pool 'XE Engine main dispatcher pool' Worker 0x0000022F0CE84160 appears to be stuck on Node 0. Approx CPU Used: kernel 0 ms, user 0 ms, Interval: 240026.
The key take aways from the errorlog output is that we have the thread id and the worker address. Sadly, since public symbols don’t have internal structures, the worker address won’t be of much help, whereas the OS thread id is of great help! In the entry above, the OS thread id is 0x3a9c and the worker address is 0x0000022D0DD64160 along with the fact that is for the “XE Engine” which points us towards extended events usage. Using this information, along with the syntax and skills we learned in previous posts, the debugging would look like this:
~~[0x3a9c]s; kc; 00 ntdll!NtWaitForMultipleObjects 01 KERNELBASE!WaitForMultipleObjectsEx 02 KERNELBASE!WaitForMultipleObjects 03 sqldk!XE_DispatcherQueue::Dequeue 04 sqldk!SOS_DispatcherPool<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::Dequeue 05 sqldk!Dispatcher<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::Run 06 sqldk!XE_SessionSrv::DispatcherEntryPointWithContext 07 sqllang!XESQLSetupThreadContext 08 sqldk!DispatcherEntryPoint 09 sqldk!SOS_DispatcherPool<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::RunDispatcher 0a sqldk!SOS_Task::Param::Execute 0b sqldk!SOS_Scheduler::RunTask 0c sqldk!SOS_Scheduler::ProcessTasks 0d sqldk!SchedulerManager::WorkerEntryPoint 0e sqldk!SystemThreadDispatcher::ProcessWorker 0f sqldk!SchedulerManager::ThreadEntryPoint 10 kernel32!BaseThreadInitThunk 11 ntdll!RtlUserThreadStart
The above will set our thread context to the stalled dispatcher thread that caused the issue, then dump the call stack without all the extra information we don’t need (and really doesn’t help for these situations). Remember that this is the call stack as the dump was generated which means it’s entirely possible that the thread made forward progress between the time the health check occurred and the time the thread was frozen. You can see in frame 3 that it’s sitting in the Dequeue function with a WaitForMultipleObjects WinAPI call. This means it’s waiting for work… and waiting for work won’t cause a stalled dispatcher. Knowing this, let’s look to see if we have a copy of the stack which was taken before the dump occurred just in case the thread has made forward progress.
.cxr sqlmin!g_copiedStackInfo+0x20; kc; 00 ntdll!NtWriteFile 01 KERNELBASE!WriteFile 02 sqldk!XE_FileWriter<XE_FileWriterDefaultPolicy<1,&XEPackage0::DefaultFileExtension>,XE_MetadataTracker>::WriteLog 03 sqldk!XE_FileWriter<XE_FileWriterDefaultPolicy<1,&XEPackage0::DefaultFileExtension>,XE_MetadataTracker>::WriteBufferInternal 04 sqldk!XE_FileWriter<XE_FileWriterDefaultPolicy<1,&XEPackage0::DefaultFileExtension>,XE_MetadataTracker>::WriteBuffer 05 sqldk!XEPackage0::XE_FileTarget::ProcessBuffer 06 sqldk!XE_SessionBasicTargetImpl::ProcessBufferAsync 07 sqldk!Dispatcher<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::Run 08 sqldk!XE_SessionSrv::DispatcherEntryPointWithContext 09 sqllang!XESQLSetupThreadContext 0a sqldk!DispatcherEntryPoint 0b sqldk!SOS_DispatcherPool<XE_Buffer,XE_BufferMgr,XE_DispatcherQueue,XE_DispatcherPoolConfig,void * __ptr64>::RunDispatcher 0c sqldk!SOS_Task::Param::Execute 0d sqldk!SOS_Scheduler::RunTask 0e sqldk!SOS_Scheduler::ProcessTasks 0f sqldk!SchedulerManager::WorkerEntryPoint 10 sqldk!SystemThreadDispatcher::ProcessWorker 11 sqldk!SchedulerManager::ThreadEntryPoint 12 kernel32!BaseThreadInitThunk 13 ntdll!RtlUserThreadStart
The before stack looks more like a thread that was stuck doing work, and if we look at what the thread was doing from frame 6 through frame 0, it was attempting to process a buffer that was in the work pool which resulted in writing that buffer to disk. Now, it could be that the write took a long time, it could be there were multiple writes that took a long time, it could be this is over some network share that is slow or otherwise disconnected and waiting on reconnect logic, etc., however we know that we should be investigating what XE sessions we have enabled on the server. Double checking there aren’t overlapping event capturing or having high impact captures such as the post execution showplan xml would also be helpful. The key thing is that while this did occur with a writefile at the top of the stack, much of the time could have been spent elsewhere (beware as some targets are synchronous and some are asynchronous). In this example, we know from the errorlog output line that kernel mode usage was 0 ms and user mode usage was 0 ms, which points to a few other different items. One is that the write was sent by Windows and was a synchronous write which caused it to wait for a very long time, the thread did not get any time on the processor, the IO was lost/dropped/waiting on the underlying storage subsystem, and a host of other issues. Use your best judgement and attempt gather more data that might be available in order to make an informed decision.