• SQL Server
  • Log Shipping Tricks Demo
  • SQLCruise Alaska 2012 Pics
SQLSoldier News From the Frontlines

Day 26 of 31 Days of Disaster Recovery: The Mysterious Case of the Long Backup

February 10, 2013 2:49 pm / 4 Comments / SQLSoldier

31 Days of Disaster Recovery

31 Days of Disaster Recovery

Welcome back for day 26 of my series 31 Days of Disaster Recovery. Today I want to share a tale of a mysterious backup that was running too long, and as the SAN admin reported, nothing had changed in terms of configuration of the SAN or our LUNs. We eventually tracked down the issue, and it was something none of of us had even considered. Likewise, it was something we never even thought to look for at the time we were investigating.

If you missed any of the earlier posts in my DR series, you can check them out here:

    31 Days of disaster Recovery

  1. Does DBCC Automatically Use Existing Snapshot?
  2. Protection From Restoring a Backup of a Contained Database
  3. Determining Files to Restore Database
  4. Back That Thang Up
  5. Dealing With Corruption in a Nonclustered Index
  6. Dealing With Corruption in Allocation Pages
  7. Writing SLAs for Disaster Recover
  8. Resolutions for All DBAs
  9. Use All the Checksums
  10. Monitoring for Corruption Errors
  11. Converting LSN Formats
  12. Extreme Disaster Recovery Training
  13. Standard Backup Scripts
  14. Fixing a Corrupt Tempdb
  15. Running DBCC CheckTable in Parallel Jobs
  16. Disaster Recovery Gems From Around The Net
  17. When are Checksums Written to a Page
  18. How to CHECKDB Like a Boss
  19. How Much Log Can a Backup Log
  20. The Case of the Backups That Wouldn’t Restore
  21. Who Deleted That Data?
  22. Which DBCC CHECK Commands Update Last Known Good DBCC
  23. Restoring Differential Backups With New Files
  24. Handling Corruption in a Clustered Index
  25. Improving Performance of Backups and Restores

Backup Performance

We had 2 databases on the server in question, the small one was 500 GB and the large one was 1.75 TB. The smaller database was basically used for authentication and only had a few hundred updates per day. As such, we rarely focused on this database very much. The main transactional database, the big one was extremely busy 24 hours a day, 7 days a week. There was no maintenance period as it handled transactions from users everywhere. Our busiest times were …. weekends. Followed next by business hours in the United States and business hours in Japan. The system was used by 30,000+ support agents around the globe. Okay, you get the picture. There was no time when it wasn’t busy.

We had highly tuned our backups. We would back up the smaller database at midnight (took less than half an hour) and the large database at 1 AM (US Pacific Time). The large database took 2 hours. We published performance metrics reports daily, and you could see that there was a small performance drop in the main database while the backup was running. From 1 AM to 3 AM, it wasn’t an issue as we had performance to spare. Over time, the backup times kept taking longer and longer. This became an issue when the backup time started taking longer than 4 hours. This put the backup completion time after 7 AM Eastern US Time which meant we were getting close to when business starting picking up again. We were still fine performance-wise in the application, but we were approaching the time that it would become a problem. Furthermore, the smaller database was now taking more than an hour to run and so it was still running when the bigger one started.

In order to maintain the size of this database, we aggressively purged data from it 4 times a day deleting support cases that were closed and had no action on them for at least 90 days. We were deleting millions of rows daily. We tracked and plotted the amount of data that we purged as well as the size of the database in our daily performance reports. There was no significant changes in either of those metrics,

We investigated the amount of activity, also in our performance reports, during the backup time frame, and no big changes there either. All performance metrics throughout the day looked completely normal. No slowness during the day, only during the backup window. We escalated it to the SAN team, and they confirmed that none of our settings on the SAN had changed and that everything looked healthy on the SAN. All SAN metrics looked good. We were on a shared SAN with many other applications, and he said that none of the others were complaining, only us.

Digging deeper, we discovered that while the backups were running, our throughput to the SAN went way down and then sometime in the 3 AM hour, throughput would return to normal. The bulk of the backup was being performed after this time. We had a theory and we needed to confirm it. We asked the SAN admin to validate the same findings on his side of the SAN.

Sure enough, the SAN was being flooded between midnight and 3 AM and bottlenecking on throughput because everyone else on the SAN was running their backups at midnight as well. We changed out backup schedule to work around this. We would back up the smaller database at 11 PM and then start the large database at 3 AM. Backup times returned to normal, and we were good again.

Summary

When you run into performance problems with your backups, it is important to look at the usual suspects first such as disk performance, activity on the server, etc. Our investigation was made much easier by having baselines of the activity that we could compare to the current levels to determine if anything had truly changed. Ultimately though, we had to trust our findings and look outside the box. We had to look outside our system at external factors that were affecting us.

Posted in: SQL Server / Tagged: 31 Days of Disaster Recovery, Disaster Recovery, Performance & Optimization

4 Thoughts on “Day 26 of 31 Days of Disaster Recovery: The Mysterious Case of the Long Backup”

  1. Pingback: Day 26 of 31 Days of Disaster Recovery: The Mysterious Case of the … « Quick Disaster Recovery.com

  2. Pingback: Day 29 of 31 Days of Disaster Recovery: Using Database Snapshots to Restore Replicated Databases in Test | SQLSoldier

  3. alzdba on February 27, 2013 at 11:33 am said:

    Nice case, we acutally had the same issue a couple of years ago. Spreading loads solved it too. :-)

    Another big issue we had ( a couple of times ) was when a Windows admin changed the backup folder to be compressed. Very, very bad idea. Turning it off solved the backup issues for those cases.

    Thanks again for this great series on Disaster Recovery.

    Reply↓
    • SQLSoldier on February 27, 2013 at 2:12 pm said:

      Thanks for sharing your issues as well.

      Reply↓

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →
<

Remote DBA Services
- serious SQL Server expertise for less than a full-time DBA
My Articles
 
My Book
Check out my interview on

Extreme Data Recovery (with Argenis Fernandez)
10 Things all BI System Administrators Should Know
Upcoming Events
    All events shown in Pacific Time

    No events to show

RSS My SQL Server Magazine Articles

  • Database Mirroring for Disaster Recovery September 16, 2011
  • Comparative Review: Database Schema Comparison Tools August 24, 2011
  • 3 Log Shipping Techniques June 22, 2011
  • Hardening SQL Server June 20, 2011
  • Review: ScriptLogic Security Explorer for SQL Server February 8, 2011

Tags

31 Days of Disaster Recovery Architecture Automation CDC & Change Tracking Data Architecture VC Database Mirroring DBCC Denali Disaster Recovery Dynamic Management Views Extended Events Gamers & Geeks General Discussion High Availability How do I ... ? Humor Idera ACE Program Internals MCM Meme Monday Performance & Optimization PowerShell Professional Development Replication Security SQLBits SQL PASS SQL PASS Summit SQLRally SQL Saturday SQL Server Magazine SQL University SSAS & BI SSIS SSMS SSRS T-SQL T-SQL Tuesday tempDB Tips & Tricks Travel Troubleshooting Undocumented Stuff Whitepapers XML in SQL

News

Download my Powershell Scripts

The following scripts can be downloaded as text files. You will need to change the file extension to .ps1 in order to execute them.

Backup a database
Restore a database
Scan a server to find a free port
Query DNS to get the FQDN of a server


To see some examples of my other forms of writing, please visit my page on WritersCafe.org. It is almost exclusively horror fiction, but I sometimes throw other things in there too from time to time. There's one science fiction story, a couple of poems, and quite a few humor pieces as well.


Look for me in the SQL Q&A section of the August, 2007 issue of TechNet Magazine.
August issue of TechNet Magazine's SQL Q&A column

Protect our Heroes

© Copyright 2012 - Robert L Davis
Infinity Theme by DesignCoral / WordPress

Twitter Twitter 
LinkedIn LinkedIn 
TLF TLF RSS RSS 
WritersCafe WritersCafe 
SQLPASS SQLPASS 
Facebook Facebook
grab this