Hopefully, you’re not finding this entry because you’re scouring the web for what reverting means while your production server is in this state… but chances are, that’s the case.
What is reverting? How did I get here? Is this real life? All great questions.
Each time a secondary replica connects to the primary, the assumption is that they don’t really know any metadata about each other and the databases. This isn’t to say that each instance doesn’t know their information about the database(s) in the availability group, but it does mean that SQL Server doesn’t assume anything and as a secondary replica connecting to a primary, that all the metadata which the secondary should know and agree upon should be negotiated. This happens in various other protocols, the one most will be familiar with is TCP, for example.
Great, our secondary connects to the primary and the negotiations start. One of the key items is the exchange of recovery fork information wherein the secondary and primary must find a common point in which the secondary should start with its recovery so that it is in sync with the primary (this is regardless of the synchronization type such as synchronous or asynchronous). Sometimes this recovery fork could be so far back that the primary doesn’t have the data – for example, restoring a copy of the database to a new replica using the backup and restore method when a log backup on the primary wiped out the common lsn point. Other times, the secondary replica is ahead of the primary – for example, after a failover for synchronous commit [log can be different by even a few log blocks] or after a forced failover and the previous primary comes back online as a secondary and checks in with the new primary. When this happens, the secondary needs to come back to a common point with the new primary – called Undo of Redo – and this is when the reverting phase for the secondary database starts (this can be gathered via the sys.dm_hadr_database_replica_states, synchronization_state = 3). Again, this can and does happen quite often, and generally is quick enough that no one really is aware (fun fact, database mirroring had basically a similar implementation [Database Mirroring has been deprecated since 2012, stop using it!]), it’s when it takes a long time that really causes an issue.
When the secondary database replica is reverting, a series of processes are happening to get it in line with the primary database replica. It does this by asking for pages from the primary which have changed since the recovery fork that the secondary has undone to in its local log. During this time, the secondary database is in an inconsistent state as it contains a mix of the original pages and any pages received from the primary. Since it’s not just pages that have changed, but file sizes, number of files, etc., any issues at this stage such as removing the database from the availability group will leave you with a database that is dead – as in, delete the files and restore from a backup, dead. *** This means when your secondary replica shows that it is reverting, let it alone to finish what it is doing. *** Revert will also take care of Hekaton, Filestream, etc., all have their functions to perform, which can be disk, cpu, and memory intensive. It _shouldn’t_ need to be stated but I will anyway because someone will undoubtedly ask, “can I query the database in reverting?” – no.
While reverting is running and happily chugging along to get the secondary database in line with the primary, there are some counters that you can look at to get a sense of how far the process has progressed and then use some of your favorite math algorithms to give you an estimated time to completion. Here are the two major items to monitor for the process:
- Perfmon, SQLServer:DatabaseReplica\Log Remaining For Undo
- Perfmon, SQLServer:DatabaseReplica\Total Log Requiring Undo
The names are on the nose, so it shouldn’t take much to ascertain that the “total log requiring undo” is the total amount of log that needs to be undone, and the “log remaining for undo” is how much is remaining. The rate of change coupled with your favorite smoothing algorithm should give you an estimated time to completion for the reverting phase only. Once this completes, the database should restart and move on to the initializing phase.
Initializing is the process to bring the database into a consistent state after the reverting process has completed. Initializing accepts the log stream since the agreed upon fork and brings the log which was snipped back to the recovery fork to bring all of the other pages and items in line with the ones copied from the current primary – which could be far ahead of the database at this time. During this point, you should see the log streaming to the secondary (along with the secondary state updated) and the typical Always On counters for availability groups will apply to understand the current status.
Once the log stream catches up, the secondary will switch to either synchronizing or synchronized (based on various items). The process is fully complete and the database should be open for users to query if it’s readable.
Common Questions and Answers
Q: What happens if I fail over during reverting or initializing?
A: SQL Server should try to compare the recovery forks and find a commonality. If so, it should restart the reverting phase, from the beginning.
Q: What happens if I restart SQL Server on the secondary that is reverting?
A: You must restore the database that was reverting. Period.
Q: What happens if I failover the AG that holds the reverting database to the secondary that is reverting for that individual database?
A: Recovery of the database will fail, you’ll have to failover to another replica.
Q: What happens if I restart SQL Server on the secondary that is Initializing?
A: The process should start from the point it left off in the Initializing phase.
Q: What happens if I remove the secondary database from the AG while it is reverting?
A: You must restore the database. Period.
Q: What happens if I remove the secondary database from the AG while it is Initializing?
A: Restore the log to bring the database back up to speed and then join it to the AG.