If you missed any of the earlier posts in my DR series, you can check them out here:
- 31 Days of disaster Recovery
- Does DBCC Automatically Use Existing Snapshot?
- Protection From Restoring a Backup of a Contained Database
- Determining Files to Restore Database
- Back That Thang Up
- Dealing With Corruption in a Nonclustered Index
- Dealing With Corruption in Allocation Pages
- Writing SLAs for Disaster Recover
- Resolutions for All DBAs
- Use All the Checksums
- Monitoring for Corruption Errors
- Converting LSN Formats
- Extreme Disaster Recovery Training
- Standard Backup Scripts
- Fixing a Corrupt Tempdb
- Running DBCC CheckTable in Parallel Jobs
- Disaster Recovery Gems From Around The Net
- When are Checksums Written to a Page
- How to CHECKDB Like a Boss
- How Much Log Can a Backup Log
The Case of the Backups That Wouldn’t Restore
When I worked in operations at Microsoft, I was on a team of about 40 operations engineers managing about 80 application. Only a small portion of the engineers were DBAs, so the core set of DBAs were often called in to consult for the other engineers when they needed in-depth SQL Server knowledge for a specific problem. I was contacted one evening by one of the engineers for a critical application. It was a SharePoint application that had been hit by a newly discovered bug that caused SharePoint to corrupt all of it’s content data. To be clear, the content database itself wasn’t corrupted, but the data it contained had been sort of mangled. It wasn’t the kind of thing that could be fixed, they had to restore.
After more than 3 hours of trying to restore the database themselves, they finally called me at home to ask for my help. They told me that most of the backups were corrupted, and they would lose way too much data if they restore the newest backup that works. They were hoping that I would have some trick up my sleeve to prevent them from losing 4 days of data.
I asked them to break down for me what they had been attempting to do and where it was failing. It turns out that they were using my Standard Backup Scripts with the default settings. This meant they were doing weekly full backups, daily differentials, and log backups every half hour. They said the full backup restored successfully, but the 3 most recent differential backups would not restore. They could restore the 4th differential backup, but that would be losing too much data. Solution was simple, simply restore the most recent differential backup that could be restored and then restore all of the log backups from that point forward to the last known good point …. right?
I told them my plan, and they didn’t seem very enthused. Turns out that when they were trying to restore the database themselves, they decided to simplify the task in front of them by deleting a bunch of the older log backups. The log backups they had not deleted only went back two days. If there was any hope of using the log backups for the restore, we had to somehow get one of the two most recent differentials to restore. they didn’t recall the exact error they got when they tried to restore the newer differential, so I tried it myself. I tested the differentials with RESTORE VERIFYONLY with no problems. They did not appear to be corrupted so I tried to restore one of them.
This was the error I received when I tried to restore the differential backup:
Msg 3136, Level 16, State 1, Line 6 This differential backup cannot be restored because the database has not been restored to the correct earlier state. Msg 3013, Level 16, State 1, Line 6 RESTORE DATABASE is terminating abnormally.
It was apparent that someone had created an out-of-band full backup of the database. The operations engineer for the application was insistent that no one that works on the application would do that, much less create the backup and then delete it. Sure enough, a full backup had been created during that time frame, and it was NOT in the backup folder. The database had been backed up to a share on the SharePoint server. The admin had done a full site backup through SharePoint, and he was not aware that a full SharePoint backup included a full backup of the database. I checked the share on the SharePoint server, and the full backup was still there.
I restored to a recent known good point in standby mode so the engineer could query the data and verify that it was good. They wanted to get closer to the point where the data corruption occurred so, I used the technique I outlined in my post SQLU DBA Week – Recovering Lost Data and recovered the log files progressively in standby mode repeatedly querying for the data to ensure we get to the most recent point. Once we found the point we felt was the best restore point, I recovered the database, and they were able to begin repopulating all of their SharePoint catalogs.
Moral of the Story
There are several things that could be learned from this experience. After recovery had completed, and everyone had gotten a good night’s sleep, we worked on improving their processes to prevent this kind of problem again. Hopefully, this tale will help you avoid making the same mistakes. Get your recovery plans in order and take heed of the below takeaways from this experience:
- Test your backups – Backups are critical, but they are useless if they can’t be restored. If they had implemented some process to test restores of their backups, they would have learned several days earlier that the most recent differential backups could not be restored. They could have addressed this problem a day or two prior to the corruption occurring and been ready to restore when the corruption had been found.
- Practice your recovery process – One of the reasons you practice your recovery process is so that when something goes wrong, you know what to do because you’ve already practiced that scenario. If they had practiced different scenarios, they might have realized that they could have simply used the log backups to complete the restore process.
- Don’t be afraid to ask for help – If things aren’t going well and you need help, don’t spend more than 3 hours trying to figure it our on your own. Ask for help. It would have saved them 3 hours of time, and it would have been early evening instead of almost bed time when they finally called me. It would have been better for them and for me.
- Know what’s going on in your environment – Part of their problem was that they were taking weekly full site backups through SharePoint with no idea of what that actually meant. It wasn’t a fluke that they had the problem that week. They would have had the same problem no matter which week it had occurred. This is especially true if you are not the SharePoint admin or “whatever admin”, it is important that you are communicating and planning your disaster recovery together. It doesn’t work well, as this experience showed, if the pieces of the same application are each doing their own thing in terms of disaster recovery.