• SQL Server
  • Log Shipping Tricks Demo
  • SQLCruise Alaska 2012 Pics
SQLSoldier News From the Frontlines

Day 20 of 31 Days of Disaster Recovery: The Case of the Backups That Wouldn’t Restore

January 24, 2013 11:13 pm / 8 Comments / SQLSoldier

31 Days of Disaster Recovery

31 Days of Disaster Recovery

I have decided to spend day 20 of my 31 Days of Disaster Recovery series by relating a true tale from my harried past days of Production DBA Operations. This is a cautionary tale with an important moral. This is the case of the backups that wouldn’t restore.

If you missed any of the earlier posts in my DR series, you can check them out here:

    31 Days of disaster Recovery

  1. Does DBCC Automatically Use Existing Snapshot?
  2. Protection From Restoring a Backup of a Contained Database
  3. Determining Files to Restore Database
  4. Back That Thang Up
  5. Dealing With Corruption in a Nonclustered Index
  6. Dealing With Corruption in Allocation Pages
  7. Writing SLAs for Disaster Recover
  8. Resolutions for All DBAs
  9. Use All the Checksums
  10. Monitoring for Corruption Errors
  11. Converting LSN Formats
  12. Extreme Disaster Recovery Training
  13. Standard Backup Scripts
  14. Fixing a Corrupt Tempdb
  15. Running DBCC CheckTable in Parallel Jobs
  16. Disaster Recovery Gems From Around The Net
  17. When are Checksums Written to a Page
  18. How to CHECKDB Like a Boss
  19. How Much Log Can a Backup Log

The Case of the Backups That Wouldn’t Restore

When I worked in operations at Microsoft, I was on a team of about 40 operations engineers managing about 80 application. Only a small portion of the engineers were DBAs, so the core set of DBAs were often called in to consult for the other engineers when they needed in-depth SQL Server knowledge for a specific problem. I was contacted one evening by one of the engineers for a critical application. It was a SharePoint application that had been hit by a newly discovered bug that caused SharePoint to corrupt all of it’s content data. To be clear, the content database itself wasn’t corrupted, but the data it contained had been sort of mangled. It wasn’t the kind of thing that could be fixed, they had to restore.

After more than 3 hours of trying to restore the database themselves, they finally called me at home to ask for my help. They told me that most of the backups were corrupted, and they would lose way too much data if they restore the newest backup that works. They were hoping that I would have some trick up my sleeve to prevent them from losing 4 days of data.

I asked them to break down for me what they had been attempting to do and where it was failing. It turns out that they were using my Standard Backup Scripts with the default settings. This meant they were doing weekly full backups, daily differentials, and log backups every half hour. They said the full backup restored successfully, but the 3 most recent differential backups would not restore. They could restore the 4th differential backup, but that would be losing too much data. Solution was simple, simply restore the most recent differential backup that could be restored and then restore all of the log backups from that point forward to the last known good point …. right?

I told them my plan, and they didn’t seem very enthused. Turns out that when they were trying to restore the database themselves, they decided to simplify the task in front of them by deleting a bunch of the older log backups. The log backups they had not deleted only went back two days. If there was any hope of using the log backups for the restore, we had to somehow get one of the two most recent differentials to restore. they didn’t recall the exact error they got when they tried to restore the newer differential, so I tried it myself. I tested the differentials with RESTORE VERIFYONLY with no problems. They did not appear to be corrupted so I tried to restore one of them.

This was the error I received when I tried to restore the differential backup:

Msg 3136, Level 16, State 1, Line 6
This differential backup cannot be restored because the database has not been restored to the correct earlier state.
Msg 3013, Level 16, State 1, Line 6
RESTORE DATABASE is terminating abnormally.

It was apparent that someone had created an out-of-band full backup of the database. The operations engineer for the application was insistent that no one that works on the application would do that, much less create the backup and then delete it. Sure enough, a full backup had been created during that time frame, and it was NOT in the backup folder. The database had been backed up to a share on the SharePoint server. The admin had done a full site backup through SharePoint, and he was not aware that a full SharePoint backup included a full backup of the database. I checked the share on the SharePoint server, and the full backup was still there.

I restored to a recent known good point in standby mode so the engineer could query the data and verify that it was good. They wanted to get closer to the point where the data corruption occurred so, I used the technique I outlined in my post SQLU DBA Week – Recovering Lost Data and recovered the log files progressively in standby mode repeatedly querying for the data to ensure we get to the most recent point. Once we found the point we felt was the best restore point, I recovered the database, and they were able to begin repopulating all of their SharePoint catalogs.

Moral of the Story

There are several things that could be learned from this experience. After recovery had completed, and everyone had gotten a good night’s sleep, we worked on improving their processes to prevent this kind of problem again. Hopefully, this tale will help you avoid making the same mistakes. Get your recovery plans in order and take heed of the below takeaways from this experience:

  • Test your backups - Backups are critical, but they are useless if they can’t be restored. If they had implemented some process to test restores of their backups, they would have learned several days earlier that the most recent differential backups could not be restored. They could have addressed this problem a day or two prior to the corruption occurring and been ready to restore when the corruption had been found.
  • Practice your recovery process - One of the reasons you practice your recovery process is so that when something goes wrong, you know what to do because you’ve already practiced that scenario. If they had practiced different scenarios, they might have realized that they could have simply used the log backups to complete the restore process.
  • Don’t be afraid to ask for help - If things aren’t going well and you need help, don’t spend more than 3 hours trying to figure it our on your own. Ask for help. It would have saved them 3 hours of time, and it would have been early evening instead of almost bed time when they finally called me. It would have been better for them and for me.
  • Know what’s going on in your environment - Part of their problem was that they were taking weekly full site backups through SharePoint with no idea of what that actually meant. It wasn’t a fluke that they had the problem that week. They would have had the same problem no matter which week it had occurred. This is especially true if you are not the SharePoint admin or “whatever admin”, it is important that you are communicating and planning your disaster recovery together. It doesn’t work well, as this experience showed, if the pieces of the same application are each doing their own thing in terms of disaster recovery.
Posted in: SQL Server / Tagged: 31 Days of Disaster Recovery, Disaster Recovery, SharePoint

8 Thoughts on “Day 20 of 31 Days of Disaster Recovery: The Case of the Backups That Wouldn’t Restore”

  1. Amit Banerjee on January 24, 2013 at 11:30 pm said:

    Out of band backups are a nightmare if you are not aware of them!

    Reply↓
    • SQLSoldier on January 25, 2013 at 10:06 am said:

      Yes, they are. Even more so when they are not in the backup folder.

      Reply↓
  2. Mirza Baig on January 25, 2013 at 5:15 pm said:

    Hi Robert,
    My understanding after reading the article is that doing full site backup through SharePoint does a full backup of the database without using “COPY_ONLY” command. Had the backup been done using the COPY_ONLY command, it would not have changed the differential base GUID and the differentials would have worked.

    Is my understanding correct? If so, can the SharePoint admin do a full site backup without a full DB backup or do a full site backup with a copy_only command?

    thanks,
    Mirza

    Reply↓
    • SQLSoldier on January 25, 2013 at 5:33 pm said:

      You are correct about the copy_only option. A copy_only backup does not affect the log chain. I’m not a SharePoint admin, so I can’t say for sure, but I don’t think the SharePoint backup options have copy_only as an option.

      Reply↓
  3. Pingback: Day 21 of 31 Days of Disaster Recovery: Who Deleted That Data? | SQLSoldier

  4. Pingback: Day 20 of 31 Days of Disaster Recovery: The Case of the Backups … « Quick Disaster Recovery.com

  5. Pingback: Day 23 of 31 Days of Disaster Recovery: Restoring Differential Backups With New Files | SQLSoldier

  6. Pingback: Day 30 of 31 Days of Disaster Recovery (T-SQL Tuesday #40): Using Partial Availability and Initialize from Backup to Replicate a Partial Database | SQLSoldier

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →
<

Remote DBA Services
- serious SQL Server expertise for less than a full-time DBA
My Articles
 
My Book
Check out my interview on

Extreme Data Recovery (with Argenis Fernandez)
10 Things all BI System Administrators Should Know
Upcoming Events
    All events shown in Pacific Time

    No events to show

RSS My SQL Server Magazine Articles

  • Database Mirroring for Disaster Recovery September 16, 2011
  • Comparative Review: Database Schema Comparison Tools August 24, 2011
  • 3 Log Shipping Techniques June 22, 2011
  • Hardening SQL Server June 20, 2011
  • Review: ScriptLogic Security Explorer for SQL Server February 8, 2011

Tags

31 Days of Disaster Recovery Architecture Automation CDC & Change Tracking Data Architecture VC Database Mirroring DBCC Denali Disaster Recovery Dynamic Management Views Extended Events Gamers & Geeks General Discussion High Availability How do I ... ? Humor Idera ACE Program Internals MCM Meme Monday Performance & Optimization PowerShell Professional Development Replication Security SQLBits SQL PASS SQL PASS Summit SQLRally SQL Saturday SQL Server Magazine SQL University SSAS & BI SSIS SSMS SSRS T-SQL T-SQL Tuesday tempDB Tips & Tricks Travel Troubleshooting Undocumented Stuff Whitepapers XML in SQL

News

Download my Powershell Scripts

The following scripts can be downloaded as text files. You will need to change the file extension to .ps1 in order to execute them.

Backup a database
Restore a database
Scan a server to find a free port
Query DNS to get the FQDN of a server


To see some examples of my other forms of writing, please visit my page on WritersCafe.org. It is almost exclusively horror fiction, but I sometimes throw other things in there too from time to time. There's one science fiction story, a couple of poems, and quite a few humor pieces as well.


Look for me in the SQL Q&A section of the August, 2007 issue of TechNet Magazine.
August issue of TechNet Magazine's SQL Q&A column

Protect our Heroes

© Copyright 2012 - Robert L Davis
Infinity Theme by DesignCoral / WordPress

Twitter Twitter 
LinkedIn LinkedIn 
TLF TLF RSS RSS 
WritersCafe WritersCafe 
SQLPASS SQLPASS 
Facebook Facebook
grab this