olved steries #2 – Slow Restore

Issue Introduction

In this olved stery the problem comes as a slow database restore. That’s what I was given, that’s all I have. It’s super fun when someone tells you the issue and there is absolutely no information or data around it, right?

Questions That Need Answered

Why is the “restore slow”
What factors, if any, are involved
What exactly does, “slow restore” even mean

Initial Analysis

This came with very little to no data available, and to be quite honest, saying “slow restore” doesn’t really mean much. The initial analysis needs to be an actual set of concrete data that describes the issue, what is normal, and what outliers, if any, exist. Since we have none, we can’t even start to analyze anything, so we need to clarify the problem statement and understand a little more about the issue.

Clarification Of The Issue

What is a “slow restore”? I asked about this, as it’s the best place to start, let’s all agree on what is meant by “slow” and make sure that we’re also agreeing on what a “restore” entails. When it comes to SQL Server there are a plethora of ways to backup and restore databases. There’s the most common native backup, the next most common VDI/VSS backup, the insanely stupid “shut the vm down and take a backup” backup, the “don’t shut the vm down and don’t use VSS to take a backup” corrupted backup, no backup (more common than one would tend to believe), SAN disk snap backup (which isn’t technically a backup), and finally the “export all the data to flat files” backup. While there are variations such as bacpacs, while it’s technically a backup, other than moving data around it’s not used to backup and restore common enough between servers (yes, Azure SQL DB is *special* in that way).

After asking about this, I was finally gifted with some needed information. We can now answer Question #3 above, in this scenario a slow restore means, “a database that used to restore from a native backup on a common file share is now taking 10 hours where it used to take 4 hours on the same server”. Great, so our baseline is 4 hours, it’s a native backup, resides on a file share which is most likely on a network appliance or other server somewhere, and now it’s taking 10 hours. Was it all databases or just one, is there anything special about this database? No information was known, so… can’t really assume anything.

Of course, “nothing has changed” and “it’s not the network”. Great.

Initial Data Collection

Since restoring a database across a network to SQL Server involves the server itself (cpu, memory), network, SQL Server, and the disk subsystem, it only makes sense to get data during a restore. Collecting things like the restore throughput for SQL Server counters, network speed and utilization, bytes/packets sent and received, all the disk counters for the restore disks, and if possible, the same (sans SQL Server counters) from the server or appliance hosting the shared location.

Unfortunately, the shared location was “off limits” and “working fine”, so no data was possible to be collected. This brings up a point, which is that without data it’s not possible to make informed and accurate decisions. While best guesses are possible given previous findings or where other data points, having the information to show and help guide is required.

Dissecting Perfmon Data

There wasn’t anything extremely interesting about the data collected. Yes, there are spikes here and there, but the interesting thing was that it *wasn’t* interesting. What does that even mean? When troubleshooting an issue and having an expectation of getting data that shows *something* is awry such as high cpu usage, low amounts of memory available, disk reads or writes that take a long time, high disk queueing, etc., and then not finding them it generally means that something is happening outside of the area that was captured, the incorrect data was captured, some other “bad actor” is influencing things, also a combination of any of these is possible.

What makes the non-interesting data interesting is that we weren’t allowed to capture data from the source file machine and that the network throughput seemed a bit low which matched the restore throughput in SQL. If we’re getting slow network throughput then yes, it’s going to take some time to get the data to where it needs to be so that processing can even _start_.

What next? Well, since we can’t get data from the source system, let’s look at the data the requesting system can see when it comes to the file IO of that backup file on the file share. In this case, WPR can be a helpful tool, so to WPR we go!

WPR Data

The database was restored again, and I didn’t want to wait 10+ hours to get some data. When issues are easily reproduceable then generally a full capture of the entire issue isn’t necessarily needed. Small captures which can be completed quickly and give insight are better than longer captures in various ways, such as having smaller discrete files to work with, taking less time to capture, and can separate multiple issues into their distinct sets rather than having multiple issues look like a single issue over the long timespan of the capture.

In this we were strictly interested in the file IO over the network to the source machine in reading the backup file, so a quick capture while the backup is running of around 5 minutes.

You can see in the picture above that the duration of read operations on the file are quite large, hitting heights of 2.6 seconds. While it doesn’t stay at 2.6 seconds the whole time, the average latency for the ~5-minute capture is 137 milliseconds with an average throughput of ~8MB/sec. Having to wait, on average, of 137 milliseconds per read operation is going to cause significant delays. This doesn’t necessarily mean there is an issue with the network, as we have no performance data from the server/appliance that the files are sitting on, so it’s really a crapshoot on where that problem exists.

Since we’re not going to get data from that, looking at the network and further portion and further doesn’t make much sense. If we’re not going to be getting data, then troubleshooting that segment is throwing darts blindfolded. What happens if we copy the file locally and restore? That eliminates the whole network + source server/appliance issues we’re seeing in this data capture.

Mitigation – Don’t use that network server/appliance

While many people would scoff at this and think to themselves, “well or course it’ll be faster local, that’s not a mitigation”, it is when the people you are working with refuse or are otherwise unable to obtain or provide and data from the source system. /smh

When restoring from a local file, things are much better, what was taking 10 hours is now taking around 8 instead. Most people would call this a great step forward, but I tend to not work with most people… so we’re back to troubleshooting. A major portion of the time has been cut down, we’re only looking for ~4 hours’ worth of “extra” time (or so it would be believed), and now our entire restore path is “local” (I’m discounting disk infrastructure such as any HBAs).

What to collect? The disk itself didn’t seem too terribly bad though it was adding some overhead with the files being local. Still, it wasn’t bad enough to warrant an increase in overall time to the tune of 4 hours. Nothing else really jumped out, and again “nothing changed” on the server. It was time to ask some more questions that I felt I wasn’t being given the full amount of information.

WPR Data Number 2

Time to get another collection, specifically looking at the SQL Server RSFX filestream filter driver – let’s collect the data for about 5 minutes during the restore and see what it says.

Looking at the data for the filter driver (below), there was a total of 2.6 seconds (units are ms in the picture) of time in the filestream filter driver. Clearly this isn’t the problem, but does anything else look as though it’s taking a large modicum of time?

What does the data say?

According to the table above, out of ~5 minutes of data capture (~300 seconds) ~264 seconds were spent in a filter driver called “flightRecorder.sys”, that’s a non-trivial amount of time. Let’s see if it has some long stacks anywhere and if they even have to do with our process (SQL Server database restore) as this could be a side effect of something else.

Ouch. Looking at the above we can see that the largest amount of time this driver impacted was in a SQL Server call, specifically dealing with creating files during a restore from a backup. That’s exactly what we’re doing… looks like we have a pretty good culprit. However, it’s hard to say what specific item in the driver is causing grief, after all it should be hurting the other restores as well! Looking at the stack one can see that the last user mode call to the Windows API is CreateFile. CreateFile is kind of a misleading name as it isn’t used _just_ to create files – it is used to create, open, and in specific settings also delete files. Let’s investigate what callbacks this driver implemented, and which are causing an issue. Do notice that in the hot stack above, the filter manager is specifically in the “pre” functions which means this driver has at least registered for “pre” operations, which means that the driver is passed the IRP *before* it makes its way further down the stack or to any other drivers in line behind it. There is one other option which is to register for “post” operations which means the IRP has been completed and is travelling back up the IO stack (rather than down in the PRE phase and the IRP not yet completed).

If we aggregate the driver time by IRP major function (which is how you register the function table in a filter driver, by giving the major function you’ve implemented and then passing a function pointer along with it) one can see that the majority of time is spent in CREATE (IRP_MJ_CREATE) which isn’t surprising given that filestream data is held in files under the root container and we knew from asking earlier that there were a large amount of files in the container. Write (IRP_MJ_WRITE) was the second most time-consuming item which are both good data points to have.

Who is FlightRecorder.sys?

We have at least one more part to this which is who the hell is FlightRecorder.sys? If we use the Windows Docs handy lookup page:

Well, hello, color me unsurprised. Yet another antivirus software that causes issues. I guess you’re secure if nothing works or it takes the virus 100 days to do anything of use? I don’t know. Whatever.

Now we’re at a point where it’s not possible to just “not scan” or “whitelist” files/programs/directories in the antivirus application. Why? Well the thing is filter drivers are loaded, and when they are loaded – regardless of what security dude from insert company tells you, all of the IRPs still go through the driver, it may not call back to user mode, it may not actively do whatever scanning or inserting or whatever the driver does, but it actively is passed the IRP and must do something with it which at a minimum is to return the IRP with a specific status to pass it to the next driver in line. There is nothing to say that the handling of the IRP itself is not the part causing the problem here, it doesn’t just have to be the “scanning” or “auditing” portion that causes issues.

This means uninstalling the software. Period. End of discussion.

In this specific instance, it was refused to uninstall the software. Exclusions were set and unsurprisingly the issues still occurred. A machine was used to restore the same backup which didn’t have the software installed and the restore ran in the normal amount of time plus or minus a little bit.

The software was still refused to be uninstalled, so there’s nothing more I can do. Mitigation #2 is to uninstall, update, or work with the 3^rd party vendor to fix the issue in their code. I can’t help with that.

Bringing It All Together

Let’s answer our outstanding questions from the beginning (#3 was already answered):

Why is the “restore slow”
What factors, if any, are involved
The restore was slow for two reasons. The first was that network delays did add a decent amount of overhead, and the second was that a local piece of antivirus software was causing overhead for each file created on the system.
Network, file creates, file writes (to an extent but not as impactful), and most likely something on the source server (if this software was installed there, it could be causing issues at that end as well for other items)

I wanted to end this with pointing out that most people are looking for a silver bullet, i.e., a single item/configuration/etc., that can be easily done to “fix” the problem. In many cases, however, there are overlapping or multiple issues all occurring at the same time. Sometimes this culminates into a vast issue, other times each part plays its own small role where it’s just enough to cause a problem. Whichever the case, many of these issues requires making a change or mitigation and collecting more data and information, it’s an onion peeling process going through the multiple layers of issues and isn’t always just a single easy and quick change.

Posts