Clustered indexes are the base data. This means we have to go to a restore scenario. Hopefully, the corruption isn’t wide spread. if it’s just 1 or a few pages, we can do page level restores. If it’s a lot of pages, it may be faster just to do a full restore. It may come down to a judgement call as to which you think is faster.
If you missed any of the earlier posts in my DR series, you can check them out here:
- 31 Days of disaster Recovery
- Does DBCC Automatically Use Existing Snapshot?
- Protection From Restoring a Backup of a Contained Database
- Determining Files to Restore Database
- Back That Thang Up
- Dealing With Corruption in a Nonclustered Index
- Dealing With Corruption in Allocation Pages
- Writing SLAs for Disaster Recover
- Resolutions for All DBAs
- Use All the Checksums
- Monitoring for Corruption Errors
- Converting LSN Formats
- Extreme Disaster Recovery Training
- Standard Backup Scripts
- Fixing a Corrupt Tempdb
- Running DBCC CheckTable in Parallel Jobs
- Disaster Recovery Gems From Around The Net
- When are Checksums Written to a Page
- How to CHECKDB Like a Boss
- How Much Log Can a Backup Log
- The Case of the Backups That Wouldn’t Restore
- Who Deleted That Data?
- Which DBCC CHECK Commands Update Last Known Good DBCC
- Restoring Differential Backups With New Files
Identify the Corruption
I’ve created a sample database for us to use for this scenario. It has a table in it with a corrupt clustered index. It also has clean backups from before the corruption occurred so we can perform a page or a full restore. In order to facilitate recovery, I had to do something we tell you to NEVER do with a corrupt database. Instead of providing you with a backup of the corrupt database, I detached the database and copied the files. To run through this demo on your own, you’re going to need to attache the files as a new database.
The scenario here is that the database was offline (let’s say the server rebooted), and corruption occurred during that time. The database is online now and seems fine. We insert some data into our table, and everything works fine. Then we query the data, and we discover the corruption.
-- Take database offline and poof, corruption occurs -- We don't know that yet though -- Add more data Insert Into dbo.FactInternetSales Select Top(500) * From AdventureWorksDW2012.dbo.FactInternetSales; Go -- Everything seems to be fine. Let's query the data Select * From dbo.FactInternetSales; Go -- Error occurs
Msg 824, Level 24, State 2, Line 2 SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:298; actual 0:0). It occurred during a read of page (1:298) in database ID 8 at offset 0x00000000c14000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL12\MSSQL\DATA\CorruptDB.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
This is a fatal error, so our query window gets disconnected when the error pops up. We know this is a serious error. We need to determine the extent of the corruption. You may be tempted to run DBCC CHECKTABLE at this point, but the corruption could be more than just a single object. I highly recommend running the DBCC CHECKDB in almost every case.
-- Let's run DBCC CHECKDB (there may be other corrupt tables) DBCC CHECKDB(CorruptDB) With No_InfoMsgs, All_ErrorMsgs, TableResults; Go -- Errors returned for: -- Database ID 8 (CorruptDB) -- Object ID 245575913 (dbo.FactInternetSales) -- Index ID 1 (clustered index)
Okay, CHECKDB shows lots of error messages, but we need to look at the ones with a Severity of 16. It appears that just a single table and a single index is corrupted. At first look, it appears that there are 2 pages that are corrupt. Taking a closer look at the error message for page #299 shows that it is reporting that the pointer to it from page #298 is missing. Page #298 is the only one that is corrupted.
If we take a look at the corrupt page using DBCC PAGE, we can see that about half of the page was overwritten with repeating 0’s. This is a sure sign of disk corruption. You need to follow this up by having the disks checked. Disk corruption may be a sign of a failing disk so it is important to investigate right away.
Decide How to Deal With It
Since it is only a single page that is corrupted, I prefer to do a single page restore here. In order to do that, I have to have certain things already in place:
- Good backup with a non-corrupted copy of the page in it. This is where having tested your backups previously really pays off. I can’t tell you how many times tables have sat corrupted for months on end before someone finds it and by then they no longer have sufficient backups to support a proper restore or they have to wait for many hours or days for another team to recover backups from tape that is in storage. TEST YOUR BACKUPS.
- Log backups to bring the page current with the rest of the database. This means that if we’re in simple recovery mode or if we don’t have all the log backups, we’re dead in the water. Our only choice in that case would be to do a full restore and save as much of the data as possible.
In our case, we see that we have a full backup and a log backup. We will need to do a tail log backup as well to really bring it current. for this restore, I’m going to put the database into restricted-user mode, take the tail log backup, and then proceed with the restore process.
To process this restore, I’m going to put the database in restricted-user mode to kick the non-admin users out of the database. Then I will do the page-level restore including a tail log backup. I’ll restore the full backup specifying the page I want to restore. Then I will restore the existing log backup and then take a tail log backup I just took. Then I can recover the database and run DBCC CHECKDB again to ensure that the corruption is fixed. If the database is clean, we can let the users back in to the database.
-- Only the one page is corrupt, so let's do a page restore -- Switch to master to restore the damaged page USE master; Go -- Set the database in restricted user mode to keep average users out Alter Database CorruptDB Set Restricted_User With Rollback Immediate; Go -- Restore the corrupt page from the good full backup Restore Database CorruptDB Page = '1:298' From Disk = 'C:\Users\SQLSoldier\Documents\BlogFiles\CorruptDB\Backups\CorruptDB.bak'; Go -- Restore the 1st pre=existing log backup to bring the page current -- SQL knows which transations to apply, no need to specify any special commands Restore Log CorruptDB From Disk = 'C:\Users\SQLSoldier\Documents\BlogFiles\CorruptDB\Backups\CorruptDB.trn' With NoRecovery; Go -- If there were more pre-existing log backups, we would restore them in order -- Now backup the tail of the log... Backup Log CorruptDB To Disk = 'C:\bak\CorruptDB_LOG_TAIL.trn' With init; Go -- Restore the tail of the log bringing the page current Restore Log CorruptDB From Disk = 'C:\bak\CorruptDB_LOG_TAIL.trn' With NoRecovery; Go -- Finally, recover the database to bring it online Restore Database CorruptDB With Recovery; Go -- Recheck the database for corruption again DBCC CheckDB(CorruptDB) With All_ErrorMsgs, No_InfoMsgs, TableResults; Go -- Allow users back in Alter Database CorruptDB Set Multi_User; Go -- Run the original query again Use CorruptDB; Select * From dbo.FactInternetSales; Go
At first glance, a page-level restore may seem a little tricky, but once you’ve done it several times, it starts to become old hand. Just remember the steps I recommend taking, and follow them one at a time. It’s not as difficult as it seems.
Download the demo scripts and sample corrupt database in zip format: Demo_ClusteredIndexCorruption.zip (5.03 MB)
Special thanks to Paul Randal (blog|@PaulRandal) for his guidance when I saw something weird happening. Something weird was happening, but not what I thought was happening. Talking to him about it cleared it up for me and allowed me to see my error that I was overlooking.