olved steries #1 – SQL FCI Failovers

Overall Introduction

I’ve had some feedback that people would like to know some of the stuff I see and how I ended up finding the issue and troubleshooting the problem. I present to you a new heading/series called “olved steries”. If you’re wondering, “What is this name and do you not know how to spell?” then I invite you to solve this mystery This series will try to make some interesting items that I’ve worked on, day to day, easy to digest for all levels of SQL people – though some parts might be more or less technical, I’ll endeavor to make it easily understandable.

Issue introduction

This issue comes in the form of a SQL Server FCI on physical hardware that keeps “failing”. It’s not happening once, twice, but constantly. When the issue occurs, it’s not possible to contact the machine, for example, an administrator is not able to create a remote session. Basically, not a whole lot to go on and a common issue. The big key point is that it’s not just SQL Server that seems to be having an issue, but the entire server.

Questions That Need Answered

  1. Why is this happening?
  2. What is root cause of the issue?

Where Do You Start?

First, we know a few things.

  1. This is a SQL Server, we should have an errorlog.
  2. This is a SQL Server FCI, we need to know what the cluster was and is thinking, cluster logs.
  3. The server is running Windows, we have multiple ETW logs such as application and system logs.
  4. This issue is affecting more than just SQL Server.

If it’s a SQL Server error or has to do with high availability such as integration with clustering, then you’re always going to start with the logs. Full stop. I can’t stress this enough as a simple stroll through the SQL/Cluster log can answer the vast majority of SQL/HA questions.

Initial Log Analysis

SQL Server Errorlog

Looking at the errorlog revealed a few different items, some of which would be expected and some which are not. One of those tell-tale signs that something is wrong is seeing a lot of IO piling up, for example:

SQL Server has encountered 1507 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\SQLDATA\MyDatabase.mdf] in database id 5. The OS file handle is 0x000000000000100C. The offset of the latest long I/O is: 0x0000410ab91000

If you see there are long I/O requests, there are a few ways to look deeper into it. I like to first start with sys.dm_io_pending_io_requests to see if the I/O is still pending according to Windows or according to SQL Server. If the io_pending column is 1 then Windows has not updated the status of the overlapped structure that is used for asynchronous requests. This means that further investigation outside of SQL Server is warranted and looking inside SQL Server will show you nothing, since we aren’t waiting for SQL Server. If, however, the majority of the io_pending values are 0 then this means we should investigate inside of SQL Server, specifically around the availability of CPUs – whether we have enough, there are no long running tasks, no non-yielding schedulers, etc., especially if all the pending I/O are on a single scheduler. Here, the DMV output clearly showed a 1, which means we should not spend any further time investigating SQL Server for reasons contributing to slow I/O.

Cluster Log

There are two parts to an FCI, the first is SQL Server and the second is Windows Failover Clustering. If we checked SQL Server’s log, we should also check the Windows Cluster log to see why it wanted to fail the resource.

WARN [RES] SQL Server <SQL Server>: [sqsrvres] Failed to retrieve data column. Return code -1
DBG [RHS] IsAliveCall::CallIntoResource(SQL Server)
ERR [RES] SQL Server <SQL Server>: [sqsrvres] Failure detected, diagnostics heartbeat is lost
INFO [RES] SQL Server <SQL Server>: [sqsrvres] IsAlive returns FALSE

Even with just the above we know quite a bit of information (I removed some other cluster log related information as it’s not pertinent to this discussion). First, it’s an FCI as the resource dll involved is “sqsrvres”, had it been an AG it would fall under “hadrres”. The second big piece of information is that we know this resource failed the IsAlive call which is just executing “sp_server_diagnostics” inside of SQL Server. Unfortunately, as of this writing, it will always return the error code of -1. Based on the (removed) timestamps, these two items – the I/O taking a long time and the resource failure – happened within a minute of each other. Correlation is not causation, but it should be investigated further.

System and Application Event Logs

There was nothing specifically of note, though there were various other issues, none were related to this specific item – such as TLS errors to various clients.

Hypothesis

We know that there was a lot of IO completed, at least 1507 that were taking a long time™. We know that the resource failed health checks shortly after SQL Server reported it was still waiting for the I/O to be completed. Since this isn’t much data to go on, more should be gathered and that is the next step.

Gathering More Data Around I/O

Perfmon

The first and most basic item to start with gathering data is a perfmon (logman) capture. This will collect counter information from the system at various intervals that can be specified for various counters. In this case, without going too much into explaining the storage stack in Windows, collecting physical and logical disk information along with CPU, Memory, Network, and Process related items should give a good picture around when the issue happens. Remember that the issue made the whole server become unresponsive and while slow I/O alone can and will do that, it makes sense to understand more of the broad picture before diving in too deep into a specific item.

Upon collection and review of the performance data, there is massive queueing of I/Os about a minute before the issue occurs. Looking at the average disk queue length, it sat at 2,200 for 30+ seconds, after which the queue drained as the resource was failed (SQL Server) and taken offline. That’s fairly damning in terms of what’s causing the issue – currently it does seem to be poor I/O performance. Having worked with clustering since Windows 2000, I do know that poor I/O will absolutely result in your system going down or otherwise failing resources. It’s also a common cause of “My computer/server is frozen”.

The interesting part is that while there is massive queueing, the latency for the requests are all over the place. In today’s world of local NVMe and NVDIMMs, having even 1 millisecond (ms) of latency I would personally consider to be bad. However, there are various types of I/O, connections, caches, etc., and so you really must understand what the underlying stack looks like to understand if the values from perfmon are indeed good or bad. In this case, the underlying stack was very good hardware and should give sub 10 millisecond latencies, however the data shows that while it wasn’t great which was hovering around 8ms, it also wasn’t terrible. There were spiked into 60ms, but a few small spikes aren’t enough to cause this kind of issue.

We know there are high queues, but not high latency. Are we even on the right track, though? Currently it would appear that this would be the case. SQL is seeing a large amount of outstanding I/O in the DMVs, Perfmon is also seeing a large amount of outstanding I/O, after this situation occurs the cluster resource seems to go down. The next logical thing to do is collect a storport trace that covers two different time periods. The first is when the system is fine, this is a baseline. The second is when the system is having the issue. Since you’re collecting a known good and known bad, it makes it easier to spot what is and isn’t normal in the environment.

Storport Trace

Storport is generically the bottom of the Windows storage stack (this isn’t pedantically correct, but it’s also not incorrect). The name comes from the fact the miniport driver is called storport.sys and this is where the tracing occurs. Generically, I/O Requests (IRP) are tagged when the leave this driver and head off outside of Windows’ control and are tagged again when they come back in through the driver. This means you can look at things like individual IRPs that come back to check how much time it took from leaving Windows to returning, or which path they took (for example if some requests are coming back with errors or taking a long time), etc.

In this situation, nothing was out of the ordinary. Sure, some requests took a longer time than average and stood out, but none took more than 60ms. That’s not enough to cause the issues we’re seeing of thousands of queued I/Os not being drained and ending up taking more than 15 seconds to complete. That means we know latency wasn’t bad from the perfmon, and to be sure we checked storport. Storport didn’t show latency or that there was an exceeding large I/O queue. Now what?

This was the bottom of the Windows I/O stack, however there are also items at the top. One of the major items at the very top of the I/O stack is called a minifilter driver – commonly referred to as filter drivers – which sit above the filesystem and other drivers, specifically the volume and partition manager drivers where much of the disk performance counters are set (for example logical disk and physical disk counters). Luckily, there is a very easy way to determine the performance of this layer, and for that we turn to Xperf/WPR.

WPR Trace

If you’re tired of collecting data and analyzing it, then good news is we’re almost done, however the bad news is that troubleshooting complex issues probably isn’t for you. WPR (Windows Performance Recorder) also plugs into the ETW (Event Tracing for Windows) providers to collect data from various parts of the system that would otherwise be extremely difficult or would be very invasive. In my opinion this is one of the great things that Windows has going for it, the rich environment for capturing and gathering data with easy to use and common tooling. WPR is installed by default in all versions of Windows Server since 2012 which makes collecting the data a trivial task, nothing to install like we needed back in the 2008/xperf days.

Here, collecting specifically a minifilter trace will help look at the top of the stack where we’re doing filter driver callbacks. Below is the result:

Looking at the above data, there are some very interesting points. The first is that I’ve filtered this to just SQL Server and ordered it by the duration spent for each IRP major function. When writing a filter driver, you create a dispatch table of what items you’re interested in filtering and if that’s a “pre” call – which means before the IRP is passed down to lower levels – and/or a “post” call – which means the IRP has been serviced and is on the way back up the stack to the application. In this case, ordered by duration, it’s easy to see that the majority of the time spent (in microseconds) is in the major function 4, which translates to IRP_MJ_WRITE, so we’re spending a decent amount of time in this driver – in fact a disproportionate amount of time is spent here even within itself for the different types. How much is too much?

Generally, since the duration is in microseconds, I don’t start to get interested in the value unless there are at least two commas. In this case, the trace ran for 11 minutes in total and this driver was responsible for adding, on average, 150ms of overhead to each write I/O. That of course isn’t more than 15 seconds, but that’s also an *average* and one that is terrifyingly horrible. Let’s go look to see if there is a large amount of overhead for this filter driver on any of the individual IRPs.

The above data shows that in the duration for many of these individual IRPs, it was taking ~23,640,000 microseconds of time to complete. Doing the math, that’s 23.64 SECONDS of time per I/O. That’s non-trivial, in fact that is matching exactly what SQL Server was complaining about – which was that it was waiting for *longer than 15 seconds* for I/O that it had asked Windows to do to be completed. No wonder the I/O was piling up. Side Rant: If you’re one of those people that are saying, “No other application is complaining about slow I/O” as a way to dismiss these messages then you’re part of the problem. Just because notepad doesn’t tell you there’s an issue doesn’t mean it doesn’t exist. Stop being a tool.

That’s quite the smoking gun, so to speak. So, what exactly is CSAgent.sys? According to official documentation (and you can also look at the module information) it’s CrowdStrike LTD., which of course is the current security software that everyone in security is currently raving about. Two very, very important points I want to make here. The first is that, specifically to CrowdStrike, the security people like to say, “It’s in monitor(ing) mode and isn’t doing anything on the system. It’s not hurting anything or adding to any performance problems. It’s not blocking any actions.”, which in this specific case was also the verbatim specifically given to me from their security team. The second is that “We’ve disabled it, it is no longer doing anything.”, which is also untrue. Unless the driver is uninstalled, it’ll be loaded into the kernel and will continue to be in the path of dispatch table entries it created. Whether or not another component of it takes actions is irrelevant. It must be uninstalled to not be in the paths and stack. Period. Full Stop.

Bringing It All Together

The first part of this, there were two questions that needed answers:

  1. Why is this happening?
  2. What is root cause of the issue?

Let’s answer them now that we’ve gathered enough data.

  1. This is happening because I/O is piling up on the system which isn’t being services in a timely manner. This causes long wait times in both the system and applications and as such causes instability and long application response times.
  2. The root cause of the issue is CrowdStrike and its filter driver. Uninstalling the CrowdStrike software and rebooting the system stopped the long I/O and bad behavior. The system did not have any further issues.

Hopefully, you found this informative and entertaining. I want to point out that I have nothing against CrowdStrike, although I personally believe it does more harm than good to the systems I’ve had to troubleshoot. Invasive drivers, whether deviceless or not, are not something to trivially install on a system, especially when on this system there was more than one of these “security” software packages running.

If you have thoughts, comments, etc., feel free to leave them below or to email me directly. If you have an interesting item that you’d like me to look at (in my non-existent spare time) feel free to shoot me an email.

7 thoughts on “olved steries #1 – SQL FCI Failovers”

  1. Great stuff, thanks! So typical that it’s a security tool causing the issue. Was there also CPU usage by some related process?

    1. Alex,

      It is very typical, sadly. Especially when the security team tells you it’s doing nothing and everyone just believes. In any case, the CPU usage was very low and didn’t really stand out for this one. I don’t have CrowdStrike’s source code but my guess is there is an internal bug that is hit with some sort of timeout, which is why many of the IRPs spent exactly 23 seconds in it. I also do not know if they’ve come out with an update to fix it or not, but I’ve seen this particular one happen on FCIs and stand alone machines.

      -Sean

  2. It never is a surprise to find that AV or other security-related app is causing an issue with a DB Server, but the way this article goes about explaining how the issue was discovered and the undeniable proof was a really great read. Thanks for putting this out.

    1. James,

      I’m glad you enjoyed it! This really wasn’t a hit piece on security but a common theme in what I see. It happens often enough and has so many various symptoms that I thought it’d make a good article. It covers many different tools and items that are both internal and external to SQL Server.

      -Sean

  3. Great detail Sean. When you got to the point of discovering that CSAgent.sys was respsonsible, I also felt the joy that you must have felt when you went back to CrowdStrike with that data. Brillant 🙂

    Hope they like their humble pie.

    1. Tom,

      The interesting thing was that CrowdStrike denied that this was a problem. They asked for “more data” and I don’t even know what that means in this context so I asked. It meant a kernel dump so they could look at the IRPs… which is between them and the company running their stuff…

      -Sean

Comments are closed.