DBCC Checks and Terabyte-Scale Databases

Rate This
  • Comments 3
Author:  Stuart Ozer  Contributors:  Suresh Kandoth, Ryan Stonecipher

Technical Reviewers:  Peter Carlin, Burzin Patel, Bob Ward

 

1.       Overview:

 

Historically, the SQL Server product team and the support organization have recommended that all maintenance plans for Microsoft® SQL Server® include running DBCC CHECKDB on a regular basis to detect and correct potential corruption in the database. In the case of mission-critical applications such as trading systems, medical records, and banking operations, many customers have interpreted this recommendation as requiring even a daily DBCC check to ensure that no errors are present before performing a backup operation.

 

However, since the SQL Server 6.5 days, in which a daily DBCC CHECKDB may have been a best practice, there have been several trends - both challenges and opportunities -- that make it worth reconsidering the way that DBCC should fit into SQL Server database maintenance plans, especially for large databases.

 

Challenges:

  • Many mission-critical databases have grown to massive size. It is not unusual to find databases in the 5-20 terabyte range on SQL Server, in which a DBCC operation requires many hours to complete. This is usually incompatible with a daily maintenance window, if feasible at all on a 24x7 system.
  • Every new version of SQL Server has expanded the range of logical checks performed by DBCC so that it has become more time-consuming to execute. For example, SQL Server 2005 introduced data purity checks to validate that data present in columns adheres to valid ranges for the data type. SQL Server 2005 also introduced checks that verify the integrity of indexed views.

 

Opportunities:

  • Enterprise-class storage subsystems and the widespread use of RAID, especially in mission-critical settings, have made the storage tier much less prone to physical corruption.
  • SQL Server has introduced new mechanisms to detect the most frequent forms of physical corruption independent of DBCC. For example, torn page detection was introduced in SQL Server 7.0, checksum verification of physical pages was introduced in SQL Server 2005, and checksum protection of tempdb pages was introduced in SQL Server 2008. 
  • We continue to invest in technologies to automatically detect and correct sources of corruption - such as the introduction in SQL Server 2005 of an in-memory page scan to find checksum violations due to ECC failures that might occur during manipulation of read-only pages.

 

As more customers are deploying databases at the scale of terabytes or tens of terabytes, frequent DBCC checks are becoming less practical. But the many advances in both SQL Server and enterprise-class hardware offerings are enabling customers to back-away from the "Daily DBCC" best practices of the SQL Server 6.5 era, while still maintaining high confidence in the integrity of their data.

 

2.       Suggestions for customers

 

DBCC CHECKDB remains an important tool for detecting and correcting logical consistency problems and physical corruption in the database. However, for large-scale databases utilizing a high quality SAN or storage subsystem, the specific recommendations this technical note presents can reduce the frequency of DBCC and certainly relax the prior standard of running such checks on daily basis.

 

SQL Server 2005 introduced an important new mechanism to proactively detect physical corruption in database pages: a database option that adds a checksum to each page as it is written to the I/O system and validates the checksum as it is read. This database option is called PAGE_VERIFY = CHECKSUM, and utilizing it by default is a powerful alternative to scheduling frequent runs of DBCC CHECKDB. For more information, see ALTER DATABASE SET Options (Transact-SQL) in SQL Server Books Online: http://msdn.microsoft.com/en-us/library/bb522682.aspx.

 

   A.      Recommendations for new databases (created in SQL Server 2005 or SQL Server 2008):


We encourage customers who have created their databases in SQL Server 2005 or SQL Server 2008 to rely on a strategy of using this PAGE_VERIFY = CHECKSUM feature and reduce the scheduled use of DBCC to a minimum. Below are some guidelines on how to accomplish this.

 

1) Use PAGE_VERIFY = CHECKSUM on all new databases created in SQL Server 2005 or later (this is the default setting for new databases).

 

2) Perform daily incremental or differential backup.

By performing an incremental/differential BACKUP WITH CHECKSUM operation each day (or even a full backup if feasible), you are guaranteed to read each page that has been added or modified since the prior backup. Because the reads performed during backup will validate the page checksums, any corruption that has occurred due to I/O errors will be detected, and can be corrected by using the current backup set to restore corrupt pages. (If the database is using the bulk-logged or full recovery model, page-level restore can be used to minimize time for recovery.)

 

The steps above will ensure that most physical page corruption due to I/O errors will be detected on a daily basis and can be promptly corrected. Because I/O errors have historically been the largest source of database corruption, PAGE_VERIFY = CHECKSUM can significantly reduce the frequency of DBCC operations needed to maintain a high level of confidence in the integrity of the physical database.

 

There are two other sources of corruption that can still arise even if PAGE_VERIFY = CHECKSUM is used, although they are rare - and they are problems that only DBCC CHECKDB can discover proactively:

 

a) ‘Scribbler' induced errors - where in-memory pages are corrupted either by third-party code running inside the SQL Server process or by drivers or other software with sufficient privileges executing in Windows® kernel mode.

SQL Server allows a variety of customer or third-party code to access the SQL Server address space, including extended stored procedures (XPs), unsafe SQL CLR assemblies, EKM providers, and OLE DB drivers for linked servers. Scribbler errors can arise as a result of bugs in these extensions, or from software such as malware protection tools or I/O drivers running in the kernel.

 

b) Potential SQL Server bugs that create logical errors.

We are not aware of any bugs in the current version of SQL Server that lead to logical errors in database objects, but the logical checks in DBCC can locate problems due to legacy errors or unknown errors in current versions of the product. Logical errors can usually be repaired by rebuilding indexes or re-establishing foreign key constraints - they are not typically fixed by restoring from a backup.

 

In addition, CHECKSUM cannot detect I/O problems if a page header itself is corrupt, but DBCC CHECKDB can.

 

These and other additional sources of errors, external to SQL Server, are reasons why customers should not eliminate DBCC CHECKDB entirely from maintenance plans. But the fact that the majority of errors located by DBCC CHECKDB result from I/O channel problems that are effectively caught using CHECKSUM allows customers to reduce the frequency of complete DBCC CHECKDB executions.

 

   B.      Recommendations for migrated databases:

 

If you are using a database that was created prior to SQL Server 2005, or if you are using a database that was created with a more recent version but without using PAGE_VERIFY = CHECKSUM, you need to be aware that checksums are added only as pages are added or modified. An initial run of DBCC CHECKDB using the PHYSICAL_ONLY option, after enabling the PAGE_VERIFY = CHECKSUM option, will check the integrity of all legacy pages but will not add a checksum.

 

Is there any risk that legacy pages from a migrated database can become corrupt and will not be detected by the checksum mechanism? It turns out that there are rare circumstances in which data on storage media can become corrupted in the absence of I/O ("bit rot") and not be detected by parity or RAID redundancy mechanisms, especially on low-end storage systems. If this occurs on a page without a SQL Server checksum, it can lead to database corruption that can only be detected by regular PHYSICAL_ONLY checks of DBCC CHECKDB.

 

So, in order to eliminate frequent DBCC CHECKDB runs for a database that was not originally created with PAGE_VEIRFY = CHECKSUM, we recommend either:

 

  • Rebuilding all clustered indexes and recreating any heaps. This will ensure that all data and index pages are written once with CHECKSUM enabled.
  • Or, less practically, creating a new database using PAGE_VEIRFY = CHECKSUM and migrating all objects to the new database.

 

The alternative, in a legacy database for which not all pages have a CHECKSUM, is to run DBCC CHECKDB with the PHYSICAL_ONLY option regularly prior to taking any full backup. If a problem is detected on a non-checksum page, the prior backup set can be used to restore a correct image of that page.

 

3.       Periodic DBCC Strategies

 

Even if daily DBCC checks need no longer be part of an enterprise-class maintenance strategy, periodic DBCC checks remain an important tool. Frequency of DBCC checks should take into account whether scribbler errors are possible due to third party extensions running in-process, the kinds of additional privileged software running on the server, and the reliability and sophistication of the storage tier.

 

There are a variety of strategies to make DBCC compatible with shorter maintenance cycles and high availability production environments. Among those that are popular for large-scale, mission-critical deployments are:

 

  • Utilizing a SAN-based snapshot, mounted to another instance of SQL Server that runs DBCC independently of the production system
  • Performing DBCC on a per-filegroup basis, on a rotating schedule
  • Using Resource Governor to adjust the degree of parallelism (MAXDOP) of the session running DBCC operations, either to run highly parallelized during a short maintenance window, or single-threaded in the background during non-critical production hours
  • Using DBCC to perform the faster physical page validation checks (PHYSICAL_ONLY) more frequently than the logical checks. Such tests will bypass time-consuming validation of foreign key references and nonclustered indexes, but will ensure that all allocated pages are readable and will detect the majority of problems that can arise from problems in the I/O channel.

 

Some of these, and other strategies for executing DBCC operations, can be found in the recent blogs by Paul Randal and Bob Dorr, at:

 

http://www.sqlskills.com/blogs/paul/post/CHECKDB-From-Every-Angle-Consistency-Checking-Options-for-a-VLDB.aspx

 

http://blogs.msdn.com/psssql/archive/2009/02/20/sql-server-is-checkdb-a-necessity.aspx

 

4.       Conclusion

 

While the reliability of SQL Server has improved dramatically over the past decade, database integrity still depends on the reliability of the storage tier and can be influenced by high-privileged, third-party code running in Windows kernel mode or permitted inside the SQL Server address space. SQL Server has invested in significant technologies to automatically detect and correct problems with data observed on-disk and in-memory, and this has reduced the need for frequent complete database integrity checks using DBCC CHECKDB. Periodic scheduled DBCC operations remain a best practice, but not at the near-daily frequency recommended in the past for mission-critical deployments. The recommendations here for using CHECKSUM features, along with periodic DBCC CHECKDB strategies, can help achieve a balance between a reasonable maintenance cycle and high confidence in database integrity.

 

Please refer to the following posts on more information about the CHECKSUM facilities:

 

                http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/29/Enabling-CHECKSUM-in-SQL2005.aspx

 

                http://blogs.msdn.com/sqlserverstorageengine/archive/2008/03/23/checksum-and-tempdb.aspx

 

                http://technet.microsoft.com/en-us/library/cc917726.aspx

 

 

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Post
Page 1 of 1 (3 items)