I would say I’m surprised but… ahem… I’m not. My own snarkyness aside, I’m not sure what updates have or haven’t been pushed out by CrowdStrike but there have been at least three incidents in the last week all caused by the software. Whatever the case may be, I wanted to put out a public service announcement (PSA) in case you’ve hit this issue and everyone is telling you it’s a SQL Server problem. It’s not. If you’re unsure, feel free to contact me.
The first major tell, and the one that is the most telling, is that SQL Server will start reporting Non-Yielding Scheduler issues and dumps. If you check the errorlog, you should see entries for the worker that appears to be nonyielding:
Process X:Y:X (0xNNNNN) Worker 0xMMMMM appears to be non-yeilding on scheduler A [...] Kernel: BBBBBB
If the time is spent or mostly spent in kernel mode, this is a major sign (along with you are actually using CrowdStrike) that you’re running into this issue.
The second major tell is if you look at the dump generated, it’ll be sitting in a Windows API dealing with disk or handles. For example:
ntdll!ZwClose KERNELBASE!CloseHandle sqlmin!FCB::Close
CloseHandle is a Windows API that does different things but essentially it decrements the usage of the handle from the process’s handle table. When the usage count for a handle drops to 0 then it can be cleaned up, such as closing a file. This should be insanely fast… except CrowdStrike decides that it wants to monitor all of this stuff. It does this by working inside the system process, which is where all drivers for the system are loaded (well, not technically correct but close enough for this discussion). This is all time attributed to Kernel time, hence why a big tell is all kernel time usage in the NYS information.
I’ve been on multiple calls with CrowdStrike and the last few have been no different. They vehemently disagree that it’s a problem with their code, even when I have the ETW tracing to show it’s in their module csagent.sys and the offset into that module. Regardless, getting an ETW trace on the system while the issue is occurring is a must in order to push back and get some traction. Alternatively, every place I helped, which uninstalled CrowdStrike in order to test, no longer had any issues (reboots are required, be prepared). Upon installing it on the system again, the issues reoccurred. Much as I’ve stated before, “exclusions” aren’t really exclusions. Regardless of telling it not to look at sqlservr.exe, the module is still resident in the kernel space and still does what it does. This will not help you.
Shutting Up Now
If you can update to get whatever latest patches CrowdStrike has to see if that fixes your issues, I’d do so. I’m a big proponent of keeping software up to date (unless there is a known issue, etc.). It’s a bit concerning, though, the number of run ins I have with this software not working well and taking down critical systems. I don’t want to be on a call at 3pm on a Saturday anymore than you do, trying to figure out what is going on. I also don’t want to hear it’s a SQL Server issue just because SQL Server is the one complaining there is an issue. I’ve said that before and I believe it was worth restating.
Hopefully not many other people are running into this, but if you are, this should help steer you in the right direction.