• SQL Server
  • Log Shipping Tricks Demo
  • SQLCruise Alaska 2012 Pics
SQLSoldier News From the Frontlines

Day 10 of 31 Days of Disaster Recovery: Monitoring for Corruption Errors

January 10, 2013 7:59 pm / 7 Comments / SQLSoldier

31 Days of Disaster Recovery

31 Days of Disaster Recovery

It’s day 10 of my 31 Days of Disaster Recovery series, and I want to talk about monitoring for corruption errors. There are four errors related to corruption for which everyone should raise alerts and send notifications. The four alerts are 823, 824, 825, and 829. The sooner you identify and address corruption, the greater the chance that it can be resolved without data loss and with minimal downtime.

If you missed any of the earlier posts in my DR series, you can check them out here:

    31 Days of disaster Recovery

  1. Does DBCC Automatically Use Existing Snapshot?
  2. Protection From Restoring a Backup of a Contained Database
  3. Determining Files to Restore Database
  4. Back That Thang Up
  5. Dealing With Corruption in a Nonclustered Index
  6. Dealing With Corruption in Allocation Pages
  7. Writing SLAs for Disaster Recover
  8. Resolutions for All DBAs
  9. Use All the Checksums

823 Errors

An 823 error is raised when a read of a page fails at the OS layer. The OS returns the failure to SQL Server who re-throws the error to the user. The attempt to read the page is retried 3 time and on the fourth failure, the error is raised. This means that when you see this error, it actually failed four times. If you see it frequently, then it has failed four times as many errors. This error is generally a sign that of problems with the storage system. It could be a disk going bad, faulty drivers, incorrectly configured disk subsystem, etc. If you see me at an event some day, ask me to share the funny story about a misconfigured SAN causing rampant file corruption on a production SQL Server (discovered before we put it into active use, fortunately).

When you receive this error, the message text will indicate the underlying cause of the error. It is important to note that not all 823 errors are an indication of corruption. For example, error 21 (Device is not ready) is often the result of the drive the file is on going offline or the service account losing permissions to access the drive. If the OS error returned is error 23 (Data error – cyclic redundancy check), then you definitely have a corruption of the file. In fact, OS error 23 is the only 823 error that will trigger automatic page repair in database mirroring and Availability Groups.

If you are receiving 823 errors, I highly advise having the disk system checked as well as it could be an indicator of pending hardware/disk failure.

824 Errors

Error 824 means that the read of the page from the file system was successful, but SQL Server detecting corruption. Just like error 823, the read is retried 3 times, and the error is raised only after the final failure. If you receive an 824 error, it has actually occurred 4 times. This error is a clear indicator of corruption and you should take corrective steps.

825 Errors

As I explained above, reads encountering 823 and 824 errors are retried 3 times and only raised if they fail all retries. An 825 error indicates that one of the retries was successful. Unlike 823 and 824 errors, 825 is NOT a fatal error and does NOT get bubbled up to the user. 825 errors can occur undetected by anyone. The error is logged in the SQL log and in the event log, so there are many ways you can detect these, but most people don’t think to do so. If you are getting many 825 errors, it can be a sign of impending disk failure, and the disks should be checked as soon as possible.

829 Errors

829 errors are lesser known errors, but they are potentially a sign of corruption caused by the disk subsystem. The error means that a page has been marked RestorePending, The only way to fix this problem is to restore the page (or file/database). It can’t be repaired using DBCC. This error, in addition to 823 (CRC failure) and 824, can trigger automatic page repair in database mirroring or Availability Groups. Ironically, when automatic page repair is triggered, it sets the page to a status of RestorePending to ensure that no other transactions can be run against it while the page repair is being performed. If you have this error occurring and it is not the result of automatic page repair, check the disk system .

Creating Alerts

Alerts can be easily created by using the GUI (under the SQL Server Agent node in Object Explorer) or via T-SQL. First, create an Operator to receive notifications and then create the alerts. Below is an example of how to create an Operator to receive emails and alerts for the four error codes assigned to email the Operator.

USE msdb;

-- Create operator
Exec msdb.dbo.sp_add_operator
        @name=N'DBAs', 
        @enabled=1,
        @email_address=N'';
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Corruption (823) detected', 
        @message_id=823,
        @enabled=1;

-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Corruption (823) detected',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Corruption (824) detected', 
        @message_id=824,
        @enabled=1;
        
-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Corruption (824) detected',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Error 825 occurred', 
        @message_id=825,
        @enabled=1;
        
-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Error 825 occurred',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Page RestorePending (829) detected', 
        @message_id=829,
        @enabled=1;
        
-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Page RestorePending (829) detected',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

Summary

Hopefully I was successful in stressing the importance of raising alerts and sending notifications for these errors. It is critical that we identify and deal with corruption as soon as possible to reduce the chance that we will lose data. So create these alerts and learn how to handle them.

Posted in: SQL Server / Tagged: 31 Days of Disaster Recovery, Disaster Recovery, Internals

7 Thoughts on “Day 10 of 31 Days of Disaster Recovery: Monitoring for Corruption Errors”

  1. Pingback: Day 11 of 31 Days of Disaster: Converting LSN Formats | SQLSoldier

  2. Pingback: Day 13 of 31 Days of Disaster Recovery: Standard Backup Scripts | SQLSoldier

  3. Pingback: Day 15 of 31 Days of Disaster Recovery: Running DBCC CheckTable in Parallel Jobs | SQLSoldier

  4. Pingback: Day 16 of 31 Days of Disaster Recovery: Disaster Recovery Gems From Around The Net | SQLSoldier

  5. Pingback: Day 17 of 31 Days of Disaster Recovery: When is a Checksum Written to a Page | SQLSoldier

  6. Pingback: Day 17 of 31 Days of Disaster Recovery: When are Checksums Written to a Page | SQLSoldier

  7. Pingback: Day 21 of 31 Days of Disaster Recovery: Who Deleted That Data? | SQLSoldier

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation

← Previous Post
Next Post →
<

Remote DBA Services
- serious SQL Server expertise for less than a full-time DBA
My Articles
 
My Book
Check out my interview on

Extreme Data Recovery (with Argenis Fernandez)
10 Things all BI System Administrators Should Know
Upcoming Events
    All events shown in Pacific Time

    No events to show

RSS My SQL Server Magazine Articles

  • Database Mirroring for Disaster Recovery September 16, 2011
  • Comparative Review: Database Schema Comparison Tools August 24, 2011
  • 3 Log Shipping Techniques June 22, 2011
  • Hardening SQL Server June 20, 2011
  • Review: ScriptLogic Security Explorer for SQL Server February 8, 2011

Tags

31 Days of Disaster Recovery Architecture Automation CDC & Change Tracking Data Architecture VC Database Mirroring DBCC Denali Disaster Recovery Dynamic Management Views Extended Events Gamers & Geeks General Discussion High Availability How do I ... ? Humor Idera ACE Program Internals MCM Meme Monday Performance & Optimization PowerShell Professional Development Replication Security SQLBits SQL PASS SQL PASS Summit SQLRally SQL Saturday SQL Server Magazine SQL University SSAS & BI SSIS SSMS SSRS T-SQL T-SQL Tuesday tempDB Tips & Tricks Travel Troubleshooting Undocumented Stuff Whitepapers XML in SQL

News

Download my Powershell Scripts

The following scripts can be downloaded as text files. You will need to change the file extension to .ps1 in order to execute them.

Backup a database
Restore a database
Scan a server to find a free port
Query DNS to get the FQDN of a server


To see some examples of my other forms of writing, please visit my page on WritersCafe.org. It is almost exclusively horror fiction, but I sometimes throw other things in there too from time to time. There's one science fiction story, a couple of poems, and quite a few humor pieces as well.


Look for me in the SQL Q&A section of the August, 2007 issue of TechNet Magazine.
August issue of TechNet Magazine's SQL Q&A column

Protect our Heroes

© Copyright 2012 - Robert L Davis
Infinity Theme by DesignCoral / WordPress

Twitter Twitter 
LinkedIn LinkedIn 
TLF TLF RSS RSS 
WritersCafe WritersCafe 
SQLPASS SQLPASS 
Facebook Facebook
grab this