One of my favorite topics to discuss is database corruption, how it happens, and how to fix it. I had noticed an outbreak of corruption questions on the MSDN forums and had replied with the best information I could, but some things are just too in depth and involved for a forum discussion. Corruption can happen for a multitude of reasons and when it strikes and the database is down, time is never on your side. Since forums posts are answered and replied to at a whim, it just isn’t the style of communication needed to really dig deep and get an understanding to the extent of the damage and what the best course of action to fix it would be. Some of the answers to these posts included links to various 3rd party software for database recovery of which I have never used a single one.
Recently I was approached by Stellar Data Recovery and asked if I would do a review on the SQL Server recovery software. Since I had never used a 3rd party piece of software for database recovery before and seeing the amount of people posting links to it I decided that it would be a great idea to have an unbiased review of different scenarios and how/where the recovery software would fit in.
Since some database corruption scenarios can be fixed quite easily with little to no downtime, I wanted to focus are the extremely difficult and pernicious types of database corruption. The end results is that we want to get all of the data back in a transactionally consistent state or as close as possible given the level of corruption.
All of these tests were run on SQL Server 2008R2 with SP2 (Version 10.50.4000). Corruption was created as though a hardware issue had happened resulting in page writes of either all zeroes or all ones. The database used was AdventureWorks2008R2 from codeplex which does have page checksums enabled.
For the tests involving data, the Production.TransactionHistory table was used. It has a good deal of information in it and had I tested all of the tables the amount of time it would have taken would be tremendous. Even though it’s a single table in the test, it is still representative of the same corruption happening on other tables.
In all of the testing, it’s assumed that some type of corruption had happened to the database and NO backups are available. Obviously this is the WORST case scenario and we must somehow either fix the corruption or get the data back, you should ALWAYS have known good backups!
The following tests were run. Below is a brief description about what type of corruption was involved and a little blurb about what this means in terms of SQL Server, database availability, and possibly recovery.
File Control Block (FCB) corruption – The FCB is always the first page (page 0 [that’s a zero]) of each file. In the case of the AdventureWorks2008R2 database, there is a single data file which is the main data file (MDF). The FCB holds different information about the file itself and how it belongs to the database. Since this is also the MDF, this is extremely important and any issues with this page will cause SQL Server to throw a fatal error and the database will not come online. The data still exists in the database but SQL Server cannot open the database and thus there is nothing that can be done in terms of out of the box solutions to get the data back (remember, we have no backups). This page can’t be single page restored.
Boot Page corruption – In SQL Server there are many special pages and other than the FCB, the boot page is probably the most important. The boot page holds many different pieces of information about the database such as the last checkpoint lsn which is a must for transactional consistency with recovery. If SQL Server can’t read the boot page or there is otherwise an issue, the database will not come online and a data error will be thrown. Once again, like with the FCB, the data still exists in the database we just can’t mount the database and get the information through SQL Server. There is nothing that can be done with this type of corruption as in there is no fix. This page can’t be single page restored.
GAM/SGAM/PFS page corruption – The GAM/SGAM/ and PFS pages are another type of special pages that SQL Server uses to track extent allocation and use (GAM/SGAM) and page allocation with different types of information (PFS). Corruption in any of these pages are a lights out issue as in there is no way to recover. The best chance for data retrieval is to script out as many objects and data as possible and then execute the scripts in a new database to recreate all of those objects and associated data. These pages can’t be single page restored.
Consistency corruption – The log files holds all of the database modifications that have happened as well as the logical flow of transactions that may or may not be committed. When corruption in the log file occurs, different things may happen. One of the most common ways of getting around this is to rebuild the log file or attach the database through the create database statement. Either of those will have some pages that are transactionally inconsistent and as such may have new data that isn’t correct. The log file can’t be single page restored and since the log holds the modifications that have happened, without a backup there is really no way to know what data is good and what is bad – though some 3rd party log reader tools may be able to give you an idea.
Clustered Index Leaf Level corruption – The leaf level of a clustered index (index id 1 level 0 [zero]) is the actual table data. This is the master copy of the data, any issues with this data structure at the leaf level will result in lost or missing data. Since this is the master copy, the index can’t be dropped and rebuilt like with a non-clustered index. This can be single page restored as long as the transaction logs from the restore medium to the latest point is available. If you have enterprise edition this page restore can be completed online – which is a huge benefit! Remember though, for these tests we’re assuming the worst case scenario so we do not have any backups. I corrupted two different parts for this test, the first part being the page header and the second part being the actual data on the page. When corrupting the data on the page I decided to do it in a manner that would make sense, which means that the page has an incorrect checksum and values were changed but those values are in the data domains of their respective fields. While this is more of a synthetic test versus a “real world” example, if the page was corrupted with non-sense then it wouldn’t have any useful data on it and thus this would be a moot point. Since page checksum errors will stop SQL Server from returning a result set for that query, this is a great example of how data can still be returned even with a corrupt page when some of the information on it is still good.
FCB Corruption Test Results
After corrupting the FCB of the Adventure works database, I attempted to bring the database online in the instance. This will obviously not work, pointing to corruption of the FCB in the error message.
Normally this would be a show stopper if there were no backups for the database and in fact, there is no “supported” recovery method that I know of for a damage FCB since it can’t be single page restored. If you’re thinking that setting the database into emergency mode might work, think again. Attempting to put the database into emergency mode will result in an error. This is truly a case of the database is down and there is no way to get the data.
It was at this point, not being able to get the database into even emergency mode I opened up Stellar Phoenix SQL Recovery tool. Being the very first test and not having used it before I was interested to see how it was going to hold up, after all SQL Server wasn’t even bringing the database online and the data was “there” (since I know the data is good and not garbled).
Selecting the MDF file and running the scan gave me a great surprise, the software had read the data! All of the data was there and I could peruse the tables, data, views, and stored procedures. Now it was time to recover this data… Choosing to scan and recover the data, you’ll be presented with some options about where you would like the recovered database to be saved and what instance the recovered database should be attached to. Once you’re past that it’s just sitting back and waiting for the software to do the work. Again, as I explained in the test setup I only exported the single table to represent the entire database. In this case, the entire database was down and with no backups any information that could be gleaned from it should be worth while.
When the process completed I checked the instance and a new database was created with “recovered_” prepended to the original database name. Upon checking the data against a “gold” (original known good) copy of the database I found all of the records to be there and correct.
So, what did I think of the software for this test? Well, if I didn’t have any backups (I can’t stress this ENOUGH) and everything came crashing down to a point where the database wasn’t accessible, I would definitely look at this software as a possible solution since it did recover the table and all of the data associated with it. Overall, this type of corruption seems to be very rare from the people I have asked and is generally shown in more severe hardware or drive failures – as such the chances of the other data not being damaged is extremely small. I also ran a test recovering the whole database and all of the views and stored procedures were there as well. I would give a win to Stellar Phoenix for this test.
Boot Page Corruption Test Results
This test is very similar to the FCB test in that a needed system page has been damaged. The database boot page is always file 1 page 9 and holds different key pieces of information that is needed for a clean database start up (or mount depending on who you talk to). One of the key pieces of information that this page holds is the database version. You might be wondering why this is important and at first I didn’t think this would be that big of a deal to lose, after all as the admin you’d know what version the database was. If we take a few minutes to really think about what this means, though, is that the database version is tied to physical code structures which change between versions. Since the version isn’t known, how should it be interpreted? Even when talking about the same “major” release such as SQL Server 2005 we saw a database version change when SP2 added support for vardecimal compression. This has huge implications for how the data structures need interpreted and something seemingly as trivial as the version actually causes a fairly large impact. Interesting!
Since SQL Server couldn’t mount or start up the database I attempted the same with Stellar’s software. Upon selecting the database file I was prompted to choose which version the database was… I had chosen correctly but the file still failed to open. I attempted this a few times to make sure I hadn’t incorrectly chosen the wrong values and each time it failed to read the database information.
It seems as though a boot page corruption will cause the database to be dead in the water. While some of the information in the boot page could be skipped if the correct database version was chosen, there are other pieces of information that would still be highly valuable and without those it seems like a shot in the dark to correctly read the information. I believe that the software, if implemented, could try and brute force read information based on the given version that was chosen but that would be a guess at best. In this test, Stellar Phoenix SQL Recovery could not recover any of the data.
GAM/SGAM/PFS Corruption Test
Already described in the brief overview, the GAM, SGAM, and PFS pages are another set of system specific pages that holds key information. In this case extent and page allocation information that is needed for the system to run properly. I had corrupted all three types of these pages but I am only including the screenshots for the GAM as all of tests had the same result.
SQL Server would not start with the corrupt pages and was marked as suspect. Given that the database used was the AdventureWorks2008R2 database with an overall size of 260 MB, a single GAM and SGAM page will cover the entire database – the GAM/SGAM pages are BITmaps and cover a range of 511232 pages or roughly 4 GB of space. PFS pages on the other hand are BYTEmaps and cover a range of 8088 pages or roughly 64 MB. After corrupting the information I was no longer able to access the database unless I put it in Emergency mode. This could be useful to script out all of the objects and data then import them into another database and continue on since these pages can’t be single page restored.
Using Stellar’s software I was able to read all of the information in the database and recover it successfully. It was the same process used as each, select the file in question, scan it, choose what data you want recovered and then tell the software to recover it. The recovered database was created and all of the information was available. This was really a tie with a learn toward the software. Since I was still able to recover the database through scripting out of the data and objects, the software did it for me which really saves time and effort.
Consistency Corruption Test
The brief introduction to this test really doesn’t do it justice when the entire implications are laid out. This really wasn’t a test of the “log file” per se as it sounded like from the brief in the beginning. In fact the only reason that I mentioned the log file is to give background on where transactional consistency is stored. For the test, I wanted to see if recovered data was transactionally consistent or not and then elaborate on the results either way – since the transaction log is integral to this I felt compelled to mention it.
This test, as aforementioned, is really about how “good” the recovered data is and if the software takes into account or reads the transaction log in order to give a consistent view of the data. After changing some data and simulating an issue, I loaded up the software to see what I would find. In this test I created my own database and table with a fresh transaction log in the full recovery model and had taken an initial full backup. The test involved inserting 100,000 rows before the simulated failure.
After loading up Stellar’s SQL Recovery software and pointing it toward the database file, I ran the scan to see what would come up. I was very surprised with the results! The test involved the insertion of 100,000 rows and the recovery software showed 70,566 in the table. What is surprising is that data was recovered and this is evidence that the log file is not used when recovering the data. This means that data recovered on a busy database will most likely be transactionally inconsistent. In the event of a real emergency and a database down situation with absolutely no backups I’m guessing some inconsistent or incorrect data is better than no data at all… but depending on the results, even with incorrect data it may be too late and causes an extremely large company loss or even worse, could cause the company to close the doors (think about a bank having this issue with inconsistent data). I know this is a trivial test but it shows the basic principles of how the software is going to work, once that is know it is possible to theorize how it will work in other more complicated scenarios.
This wasn’t as much of a win/lose test as it was a test in understanding, as in understanding what the cost (in terms of data) is for using a recovery tool. In this case it could be incorrect or inconsistent data… does that help to even recover it? Different situations will warrant different answers, but in the end the best answer is that you can restore from backups and lose little to no data.
In the screenshots below, you’ll notice that when the database comes back online with the correct log file that the transaction was correctly rolled back and that the table contains no data. If a low isolation level (read uncommitted) is used before the crash, 100,000 records would be found. What is interesting is that the software found 70,566 rows which may leave you to wonder why – this is a combination of how SQL Server works and background threads such as checkpoint. It’s good fun to go behind the scenes are figure this out, though that’s a post for another time.
Clustered Index Leaf Level Corruption Test
This test involved two different parts, the first being a corruption of the data page header and the second being a corruption of the data page itself.
The data page header corruption part of the test was to check if the software would correctly process a data page with a corrupt header. I highly doubted that this would be possible as the header holds the information about the page, but I didn’t know if the software would be smart enough to attempt to read the page by walking through the links from other pages and IAM pages.
The data page corruption of the data test was to see if checksum corruption of a page with incorrect data on it would be read or discarded. This is important as it represents a real life scenario where data has been garbled in a clustered index. If the data is garbled, I’d argue the real use for any type of recovery software as the information isn’t information anymore but a mass of wasted bits. However there are certain scenarios where the data could be in the proper domain but still have a checksum error which is what I am testing here.
First I wanted to start off with the page header corruption test. Asking for data from a page with a corrupt header will cause issues within SQL Server, the connection will be terminated and an 824 error will be returned. This could have huge business impact if that data is needed and can’t be returned or rebuilt from another source. In this case the TransactionHistory table was once again used, which has a density on the page of 144 records (for future reference). If read correctly all of the information should show up, if not then we should be missing 144 records. After causing the corruption and verifying the data still exists in an uncorrupted state (just the header was corrupted) I tested the software.
The results of the header test were as I expected. While the table was able to be read, it was missing 144 records which translated into the missing values on the page I corrupted. Since I didn’t expect the software to traverse the links and attempt to interpret a corrupted page this would be what I consider a normal result. It’s extremely hard to create software that deals with corruption, especially something like database corruption from a closed source vendor such as Microsoft. I’m not saying that it isn’t possible, but the company would have to work closely with the vendor to make sure all bases are covered – this is extremely hard to do and throwing corruption into the mix and potential recovery algorithms makes it for a very large and expensive guessing game. I applaud the effort of the software for recovering as much as it did, I do wish that an attempt to traverse chain linkages (from my observations) and attempts to interpret data pages from those would be extremely valuable. Granted if the header is corrupt the data is most likely going to be corrupt as well, so the pay off might not be in there for time spent on this – just a nice to have.
Stellar Phoenix SQL Recovery was able to recover the table minus the 144 records on the corrupt page. This would most certainly be useful if there was rampant corruption across the table as this provides an automated way of recovering the table and information instead of manually scripting the objects and data out. In the screenshots below I show that the data does exist and is still intact by manually converting the row information. If done manually by hand, in this scenario, all of the data could be recovered – this assumes an extremely deep knowledge of SQL Server internals as not all rows are all fixed length such as this and could include things like LOB columns or other off-row storage which could make manual recovery not viable.
Next up was the data page corruption test. For this test, the corruption of both the checksum and the data on the page was done – though I did keep the data in the correct domain (int stayed as int, etc). This could be validated in the real world as some companies are still running SQL Server 2000 which could result in column values that are outside the data domain. While this could be fixed through an update statement, if other corruption existed it may or may not be a viable route to go down. This was more a baseline testing of if checksums would be checked and if so would the rows still be processed.
The results from running the recovery software against this type of corruption were very good. All of the records were returned successfully, though the incorrect (corrupted) data was also returned. This may or may not be acceptable to the company attempting the recovery but is much better than losing all of the information on the page. If this was attempted to be recovered through SQL Server and no manual row information cracking as I did in the header example above, all 144 records would be lost. In this case, the recovery software was able to recover all of the data which means only a single row (in this test) would have to either be researched or thrown away and the other 143 could still be used. This lowers the overall loss of data – but as always, having known good backups and an action plan for something like this happening is the best course of action. Imagine doing this across an entire database… manually it would be a nightmare to say the least.
I feel that Stellar Phoenix SQL Recovery did an acceptable and fair job and retrieving the information from the corrupted pages. While this process could be investigated and accomplished manually, I don’t feel that anything more than a few corrupted pages would be possible to fix through that process. Introducing software that automates this dramatically cuts down the amount of time that would be needed to accomplish the same task – albeit with a possibility of slightly less accuracy. Overall though I feel it served the purpose it was designed for.
I wouldn’t look at this product as the holy grail of database recovery but I would look at it as a good tool to use depending on the type of corruption that you’re facing. With it’s abilities to recover data from databases with corrupt GAM/SGAM/PFS/FCB pages I feel that this could most certainly be invaluable to a corruption issue of those types. Outside of that I feel that the automation part of the recovery is extremely useful over attempting the same feat through a manual method. Overall I would say it’s a fairly good product for what it was designed for though it isn’t a replacement for known good backups and isn’t a magic bullet for all corruption needs (though nothing currently is).
With a price of $399 for a single install and use version, this could offset the money lost in the event of corruption issues – especially when that issue involves one of the corruption types that SQL Server can’t recover from (such as the FCB page). This would need to be a management decision and of course you can download and run the scan for free to see if it would make sense to pay the price for the information that it shows it could recover.