Day 7 of 31 Days of Disaster Recovery: Writing SLAs for Disaster RecoveryToday is day 7 in my series on disaster recovery. I thought I would switch gears for today and write about disaster recovery Service-Level Agreements (SLAs). Specifically, I’m talking about Recovery Point Objective (RPO) and Recovery Time Objective (RTO) SLAs.
Recovery SLAs are very rarely written first. They are usually ignored altogether or written as the last step to indicate how long they think their restore process will take. This usually ends up with unhappy business interests when a disaster occurs.
It should be the first thing you write, followed by determining the restore process required to meet those SLAs, and finally writing a backup policy to meet those restore policies. These SLAs should not just be numbers you come up with on your own. You should involve your business parties to make sure that their expectations are the same as your SLAs.
If you missed any of the earlier posts in the series, you can check them out here:
- 31 Days of disaster Recovery
- Does DBCC Automatically Use Existing Snapshot?
- Protection From Restoring a Backup of a Contained Database
- Determining Files to Restore Database
- Back That Thang Up
- Dealing With Corruption in a Nonclustered Index
- Dealing With Corruption in Allocation Pages
Recovery Time Objective
Recovery Time Objective (RTO) is the maximum amount of time that is considered acceptable to be down when a disaster event occurs. RTO will dictate the restore path you need to take to restore the database. Typically, a shorter RTO means minimizing the amount of file that must be restored. Log backups restore very quickly, but when you have a high number of backups to restore, it can become a long process very quickly. If you back up the log every half hour (the maximum frequency I recommend), then taking full or differential backup once a day means that you could conceivably have to restore 48 log backups. If you back up the log every 5 minutes, that 48 becomes 288 backups. It quickly escalates into a long recovery period. You may just need to add in some additional differential backups to meet the RTO SLA.
Recovery Time Objective (RTO): The amount of time it is acceptable to be down while recovering from a disaster.
Recovery Point Objective
Recovery Point Objective (RPO) is the point in which you need to recover the data after a disaster occurs. Or in simpler terms, how much data can you lose. This will dictate your recovery model and maximum frequency for the log backups. Unless you have a really high SLA for RPO, then simple recovery model is not going to be an option for you. The amount of data you are susceptible to losing is the maximum length of time between backups. If you are only doing daily full backups, then you are susceptible to losing 24 hours worth of data. So chances are good that you will be using full or bulk-logged recovery model in production on almost all cases. I always assume that I will be using full recovery model until someone presents justification for doing either simple or bulk-logged recovery.
An example situation where I have dealt with a nonstandard RPO is for a reporting database in which we could re-import data from source systems. We had an RPO of 4 days and could do a full import of the data (7 TB) in that time frame if we needed to. In this case, we used simple recovery model and performed weekly full backups to give us a more recent starting point and ensure we had a current backup with the schema in it.
Recovery Point Objective (RPO): The point in time in relation to the disaster event to which you need to recover the data.
Working With Business
When I involve business in setting the recovery SLAs, I like to talk about a range of values as a “goal” RTO and “target” RTO. If you just ask the business interests how long you can be down and how much data you can lose, they will almost always say that they want 0 downtime and 0 data loss. That’s not realistic and attempts to approach that can be very expensive. I like to stress to the business that our goals are to try for 0 data loss and 0 downtime, but that we need them to set the realistic target values that will be deemed acceptable.
Here is an example request for the RPO and RTO SLA that I might send to business interests:
In order to determine the most effective and cost-efficient plan for meeting business requirements, we need the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) defined by business. RPO and RTO are defined as ranges.
Recovery Point Objective: The point in time in relation to the disaster event to which we need to recover the data.
Goal RPO: 0 minutes
Target RPO: 30 minutes
Recovery Time Objective: The amount of time it is acceptable to be down while recovering from a disaster.
Goal RTO: 0 minutes
Target RTO: 30 minutes
In order to meet business requirements for RPO and RTO as defined above, we will implement log shipping to a remote location as our primary method of disaster recovery, and we will implement a backup and restore strategy as a secondary method of disaster recovery.
One of my favorite sayings is that the first and last thing a DBA should do when inheriting a server is backups. If I inherit a server, the first thing I do is to put my “standard backup process” on the server and make sure that all databases have at least a current full backup. Then once other critical things are taken care of, I will revisit backups and that is when I start with generating SLAs for recovery and then move on to developing the restore plan and finally the backup plan. It’s critical that we get the SLAs in place and then build our recovery plan on that.