olved steries #4 – SQL Won’t Shut Down!

Issue Introduction

I was sent a quite an interesting email stating that SQL Server would not shut down. Attempting to stop the service via services or the SQL Server Configuration Manager resulted in a timeout with SQL Server still running. Trying to execute the shutdown with and without NOWAIT T-SQL command resulted in the same, the process still running. Seems quite weird that SQL Server just refuses to shutdown!

Questions That Need Answered

1. Why isn’t SQL Server shutting down properly?
2. Is there anything that can be done to fix it?

Initial Data Collection

Since SQL Server isn’t shutting down, getting a snapshot of what’s going on is a good first step. Using SQLDUMPR is out of the question here, the trusty procdump sysinternals tool works well here. Since a memory dump is a point in time snapshot of the process, it’ll first need to be in the problematic state before a dump will be useful. There are different scenarios when you’d want to get multiple dumps with some time between each, thus is not one of those scenarios though.

Checking The Collected Data

When looking at dumps like this, where there isn’t a specific error, you really have to just take a 10,000 foot view of what the process is doing. This is true not just for SQL Server but for any other process, no matter what operating system it is running on. In this specific situation, there wasn’t anything extremely remarkable… when you look at enough SQL Server dumps you start to get a feel for what looks like it might be an issue and what just seems to be normal. Going through the stacks on the first pass, there was only one thread I noted as being interesting (below) and I think you’ll agree!

00 ntdll!NtSignalAndWaitForSingleObject
01 KERNELBASE!SignalObjectAndWait
02 sqldk!SOS_Scheduler::SwitchToThreadWorker
03 sqldk!SOS_Scheduler::Switch
04 sqldk!SOS_Scheduler::SuspendNonPreemptive
05 sqldk!WaitableBase::Wait
06 sqllang!CServiceBrokerMgr::NotifyInstanceShutdown
07 sqlmin!DBMgr::ShutdownAll
08 sqlservr!SignalProc
09 sqlmin!StartUp::InitDBMS
0a sqldk!SOS_Task::Param::Execute
0b sqldk!SOS_Scheduler::RunTask
0c sqldk!SOS_Scheduler::ProcessTasks
0d sqldk!SchedulerManager::WorkerEntryPoint
0e sqldk!SystemThread::RunWorker
0f sqldk!SystemThreadDispatcher::ProcessWorker
10 sqldk!SchedulerManager::ThreadEntryPoint
11 kernel32!BaseThreadInitThunk
12 ntdll!RtlUserThreadStart

If you look at the stack, frame 7 is calling into DBMgr::ShutdownAll which seems to be a good place to start if you’re going to shut the instance down. Continuing to go further up the stack (toward the lower numbers, the way the stack works is the frame at the top is what is running and the frames below it are what called into that frame [not always true, for example jmp commands]) the next frame, 6, is on sqllang!CServiceBrokerMgr::NotifyInstanceShutdown and that really seems to look like the instance is waiting on service broker. If we continue to move up the stack, frame 5 transitions to a wait from which we then further transition up the stack going through the typical SQL OS implementation of thread scheduling and finally to a WaitForSingleObject WINAPI call. This means the thread is suspended (according to both SQL and Windows) and it will be woken up and scheduled when the waitable object is signaled (which is different depending on the object, for example on a timer will occur when the timer expires whereas waiting on a thread will signal when the thread exits).

The seemingly shutdown thread is waiting on the service broker manager… ok… so that points us in the direction of service broker. Let’s check the dump again, looking for anything dealing with service broker.. and we do find one thread that seems to be running and waiting for the next message.

00 ntdll!NtSignalAndWaitForSingleObject
01 KERNELBASE!SignalObjectAndWait
02 sqldk!SOS_Scheduler::SwitchToThreadWorker
03 sqldk!SOS_Scheduler::Switch
04 sqldk!SOS_Scheduler::SuspendNonPreemptive
05 sqldk!WaitableBase::Wait
06 sqllang!CBrokerWaitFor::ReceiveWaitFor
07 sqllang!CBrokerQueryNotification::Wait
08 sqllang!CXStmtQuery::WaitForResults
09 sqllang!CXStmtDML::XretDMLExecute
0a sqllang!CXStmtReceive::ExecuteWaitFor
0b sqllang!CXStmtReceive::XretExecute
0c sqllang!CMsqlExecContext::ExecuteStmts<1,0>
0d sqllang!CMsqlExecContext::FExecute
0e sqllang!CSQLSource::Execute
0f sqllang!CSbActivationProcTask::Execute
10 sqllang!CSbTask::TaskStart
11 sqllang!CSbTask::BeginTaskStart
12 sqldk!SOS_Task::Param::Execute
13 sqldk!SOS_Scheduler::RunTask
14 sqldk!SOS_Scheduler::ProcessTasks
15 sqldk!SchedulerManager::WorkerEntryPoint
16 sqldk!SystemThread::RunWorker
17 sqldk!SystemThreadDispatcher::ProcessWorker
18 sqldk!SchedulerManager::ThreadEntryPoint
19 kernel32!BaseThreadInitThunk
1a ntdll!RtlUserThreadStart

Notice in frame 0x11 (note that all frames are in hex) we can see a service broker task that was created and eventually picked up and executed… but from where? Moving up the stack, on frame 0xf, this gives us a sqllang!CSbActivationProcTask::Execute which means it’s in an activated stored procedure. Alright, so we’re yet again given a clue but no actual answer. Some of you may be thinking, “But Sean, look at frame 0xe! Why can’t you just look at that and figure out the T-SQL that was running?!” and you’d be right in thinking that. However, to view the local variables one would need to have the private symbols of which I do not have as a member of the general public which is why you and I can’t just go to frame 0xe and checking the locals. You might then say, “Ok, sure, but you could do so many more things to figure out what this might be!” and sure, but that wouldn’t be a very good use of your time to be honest.

We have some information that an activated stored procedure is running, now what? I gave this information back to the person and asked what service broker activated stored procedures on queues they were using and it turns out there was just a single one… for login auditing. Another step forward, wooo! Let’s take another look at the dump and see if we see any login items.

00 ntdll!NtSignalAndWaitForSingleObject
01 KERNELBASE!SignalObjectAndWait
02 sqldk!SOS_Scheduler::SwitchToThreadWorker
03 sqldk!SOS_Scheduler::Switch
04 sqldk!SOS_Scheduler::SuspendNonPreemptive
05 sqllang!SOS_Task::Sleep
06 sqllang!CTraceEvDataQueue::Enqueue
07 sqllang!CTraceEv::Fire
08 sqllang!CTraceController::ProduceRecord
09 sqllang!CSecSpecialAudit::PopulateTraceForLoginFailed
0a sqllang!CSecSpecialAudit::AuditEvent
0b sqllang!AuditCheckSpecialAction
0c sqllang!ReportLoginFailure
0d sqllang!LoginFailedToReasonPhraseMap
0e sqllang!login
0f sqllang!process_login_finish
10 sqllang!process_login
11 sqllang!process_commands_internal
12 sqllang!process_messages
13 sqldk!SOS_Task::Param::Execute
14 sqldk!SOS_Scheduler::RunTask
15 sqldk!SOS_Scheduler::ProcessTasks
16 sqldk!SchedulerManager::WorkerEntryPoint
17 sqldk!SystemThread::RunWorker
18 sqldk!SystemThreadDispatcher::ProcessWorker
19 sqldk!SchedulerManager::ThreadEntryPoint
1a kernel32!BaseThreadInitThunk
1b ntdll!RtlUserThreadStart

There were 6 threads all with failed login processing, which seems like quite a few. If these are failed logins, then there should be an entry in the errorlog for said items. I asked the person to check the errorlog and their response was there are a large number of login failed errors, specifically stating due to the instance shutting down. Here, a large number is a very large number, hundreds per second.

Possible Mitigations and Results

We know it’s service broker causing a problem, specifically with the auditing of the login failures. I presented the following possible mitigations:
1. Check and add, if it didn’t exist, a TIMEOUT value as part of the WAITFOR statement.
2. Disable the queue before shutting SQL down.
3. Change the applications to have a backoff timer for connection attempts (don’t DOS yourself).

The response I received (ha!) was that there wasn’t a timeout but adding a timeout still caused the problem. It wasn’t feasible to change the applications to be written to do backoff, due to the fact that they were setup in a quasi micro-service setup. Disabling the queue did in fact work for them, and that’s how they decided to proceed. Seems a bit heavy handed, but service broker pretty much hasn’t been touched since it was created. It’s a very interesting concept and item to put into a database system, but not really what database systems were made to really do.

Bringing It All Together and Answering Questions

We kind of know what is going on now, and to be quite honest this looks like a possible missed edge case which should be fixed in SQL Server. While a workaround does work, this wouldn’t go well when it comes to things such as patching, automatic failover, etc., I’m just not a fan as shutting down the instance should shut down the instance. Since service broker hasn’t been touched in over a decade and the fact is that not many people use it, I wouldn’t hold my breath for any updates or fixes in this area.

Let’s answer our questions from before:

1. Service broker was firing in response to a large number of failed logins which was stopping the instance from shutting down, which was due to the auditing being done through service broker and the fact that logins will fail when the instance is shutting down coupled with the fact that there were hundreds of failures per second.
2. There’s nothing that can really be done to fix it, this would most likely take a update to SQL Server’s code to truly fix.

Closing Thoughts

Getting a dump isn’t always the thing to do, but in this case it really helped lead to the root cause. Sometimes the main issue doesn’t really show itself and you need to keep following the clues… there seems to be some random thought by everyone where the problem and solution is always very straight forward and it’s only one simple thing and were on the nose. That just doesn’t really happen in most cases. It’s perfectly fine to keep going over the data and following the leads, in fact you must do this for any sufficiently hard problem. Troubleshooting is quickly becoming a lost skill/art in the age of cloud computing, which seems to push the blame on whomever provides the service, whether or not that service is the problem. If you want to spend less time going back and forth with random support companies, dig in a little and do some basics (note that this issue and troubleshooting is _not_ the basics).

Anyway, this was a very interesting issue and I hope you all enjoyed it as well!