Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

Rate This
  • Comments 26

Contributors: Stuart Ozer, with Prem Mehra and Kevin Cox

Technical Reviewers: Lubor Kollar, Thomas Kejser, Denny Lee, Jimmy May, Michael Redman, Sanjay Mishra

Building a large scale relational data warehouse is a complex task. This article describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning. Most of these tips are based on experiences building large data warehouses on SQL Server 2005.

1

Consider partitioning large fact tables 

  • Consider partitioning fact tables that are 50 to 100GB or larger.
  • Partitioning can provide manageability and often performance benefits.
    • Faster, more granular index maintenance.
    • More flexible backup / restore options.
    • Faster data loading and deleting
  • Faster queries when restricted to a single partition..
  • Typically partition the fact table on the date key.
    • Enables sliding window.
  • Enables partition elimination.

2

Build clustered index on the date key of the fact table

  • This supports efficient queries to populate cubes or retrieve a historical data slice.
  • If you load data in a batch window then use the options ALLOW_ROW_LOCKS = OFF and ALLOW_PAGE_LOCKS = OFF for the clustered index on the fact table. This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates.
  • Build nonclustered indexes for each foreign key. This helps ‘pinpoint queries' to extract rows based on a selective dimension predicate.Use filegroups for administration requirements such as backup / restore, partial database availability, etc.

3

Choose partition grain carefully

  • Most customers use month, quarter, or year.
  • For efficient deletes, you must delete one full partition at a time.
  • It is faster to load a complete partition at a time.
    • Daily partitions for daily loads may be an attractive option.
    • However, keep in mind that a table can have a maximum of 1000 partitions.
  • Partition grain affects query parallelism.
    • For SQL Server 2005:
      • Queries touching a single partition can parallelize up to MAXDOP (maximum degree of parallelism).
      • Queries touching multiple partitions use one thread per partition up to MAXDOP.
    • For SQL Server 2008:
      • Parallel threads up to MAXDOP are distributed proportionally to scan partitions, and multiple threads per partition may be used even when several partitions must be scanned.
  • Avoid a partition design where only 2 or 3 partitions are touched by frequent queries, if you need MAXDOP parallelism (assuming MAXDOP =4 or larger). 

4

Design dimension tables appropriately

  • Use integer surrogate keys for all dimensions, other than the Date dimension. Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow.
  • Use a meaningful date key of integer type derivable from the DATETIME data type (for example: 20060215).
    • Don't use a surrogate Key for the Date dimension
    • Easy to write queries that put a WHERE clause on this column, which will allow partition elimination of the fact table.
  • Build a clustered index on the surrogate key for each dimension table, and build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads.
  • Build nonclustered indexes on other frequently searched dimension columns.
  • Avoid partitioning dimension tables.
  • Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them. Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data.

5

Write effective queries for partition elimination

  • Whenever possible, place a query predicate (WHERE condition) directly on the partitioning key (Date dimension key) of the fact table.

6

Use Sliding Window technique to maintain data

  • Maintain a rolling time window for online access to the fact tables. Load newest data, unload oldest data.
  • Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement.
  • Avoid split or merge of populated partitions. Splitting or merging populated partitions can be extremely inefficient, as this may cause as much as 4 times more log generation, and also cause severe locking.
  • Create the load staging table in the same filegroup as the partition you are loading.
  • Create the unload staging table in the same filegroup as the partition you are deleteing.
  • It is fastest to load newest full partition at one time, but only possible when partition size is equal to the data load frequency (for example, you have one partition per day, and you load data once per day).
  • If the partition size doesn't match the data load frequency, incrementally load the latest partition.
  • Various options for loading bulk data into a partitioned table are discussed in the whitepaper http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/loading_bulk_data_partitioned_table.mspx.
  • Always unload one partition at a time.

7

Efficiently load the initial data

  • Use SIMPLE or BULK LOGGED recovery model during the initial data load.
  • Create the partitioned fact table with the Clustered index.
  • Create non-indexed staging tables for each partition, and separate source data files for populating each partition.
  • Populate the staging tables in parallel.
    • Use multiple BULK INSERT, BCP or SSIS tasks.
      • Create as many load scripts to run in parallel as there are CPUs, if there is no IO bottleneck. If IO bandwidth is limited, use fewer scripts in parallel.
      • Use 0 batch size in the load.
      • Use 0 commit size in the load.
      • Use TABLOCK.
      • Use BULK INSERT if the sources are flat files on the same server. Use BCP or SSIS if data is being pushed from remote machines.
  • Build a clustered index on each staging table, then create appropriate CHECK constraints.
  • SWITCH all partitions into the partitioned table.
  • Build nonclustered indexes on the partitioned table.
  • Possible to load 1 TB in under an hour on a 64-CPU server with a SAN capable of 14 GB/Sec throughput (non-indexed table). Refer to SQLCAT blog entry http://blogs.msdn.com/sqlcat/archive/2006/05/19/602142.aspx for details.

8

Efficiently delete old data

  • Use partition switching whenever possible.
  • To delete millions of rows from nonpartitioned, indexed tables
    • Avoid DELETE FROM ...WHERE ...
      • Huge locking and logging issues
      • Long rollback if the delete is canceled
    • Usually faster to
      • INSERT the records to keep into a non-indexed table
      • Create index(es) on the table
      • Rename the new table to replace the original
  • As an alternative, ‘trickle' deletes using the following repeatedly in a loop


    DELETE TOP (1000) ... ;

    COMMIT
  • Another alternative is to update the row to mark as deleted, then delete later during non critical time.

9

Manage statistics manually

  • Statistics on partitioned tables are maintained for the table as a whole.
  • Manually update statistics on large fact tables after loading new data.
  • Manually update statistics after rebuilding index on a partition.
  • If you regularly update statistics after periodic loads, you may turn off autostats on that table.
  • This is important for optimizing queries that may need to read only the newest data.
  • Updating statistics on small dimension tables after incremental loads may also help performance. Use FULLSCAN option on update statistics on dimension tables for more accurate query plans.

10

Consider efficient backup strategies

  • Backing up the entire database may take significant amount of time for a very large database.
    • For example, backing up a 2 TB database to a 10-spindle RAID-5 disk on a SAN may take 2 hours (at the rate 275 MB/sec).
  • Snapshot backup using SAN technology is a very good option.
  • Reduce the volume of data to backup regularly.
    • The filegroups for the historical partitions can be marked as READ ONLY.
    • Perform a filegroup backup once when a filegroup becomes read-only.
    • Perform regular backups only on the read / write filegroups.
  • Note that RESTOREs of the read-only filegroups cannot be performed in parallel.
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Post
  • thanks for your tips

  • 2 Adds:

    Recommend the use of a physical archive for each CPU and each drive for each database file. ETL should get benefit from multiple CPU write on a single Database

    Update Statistics with full scan is not a good recomendation on specially big tables as statistics for 300 hundred million rows lasts more than 5 hours. Y have near 25 like that. Maybe better update only index statistics. (8 minutes)

  • Athrarrn, I fully agree with your comment on the "full scan statistics". Observe that we recommend updating stats with fullscan for the dimension tables only, not the fact tables. This may be redundant recommendation becasue for most dimension tables we will use fullscan by default anyway since on tables smaller than 1024 pages SQL Server always performs fullscan even if a smaller sample is requested.

    It seems to me that you may have interesting environment - would you mind sending me email referring to this blog to "lubork  at microsoft.com"? I would like to ask you some questions about your setup.

  • Hi there,

    Thanks for compiling a good list of best practices for large scale data warehouses.  One thing I didn't notice was any mention of aligned versus non-aligned indexing.  Aligned indexing has typically been more effective in our environment but there have also been instances where we've benefited from the use of non-aligned indexes, so I thought it may warrant a mention.

    Also, I was curious if you've had any experience with replicating partitioned tables.  We have several partitioned tables that are replicated, thus preventing us from dropping partitions as part of a sliding window.  I've heard that 2008 is supposed to support swapping of partitions, but in 2005 it appears we're out of luck.

    Until we receive approval to upgrade to 2008 (lol), we're looking to develop a SSB-based solution that would replace replication on our largest tables (> 1 billion rows).  Have you any experience, warnings, or suggestions for service broker with high volume partitioned tables (average of 10mm rows with spikes up to 100mm rows written per hour)?

    Best Regards,

    Michelle

  • To #3) If I'm not mistaken in SS2008 the problem with queries touching more than 1 partition was fixed and now those queries SHOULD parallelize upto MAXDOP as well. The entire query should parallelize upto MAXDOP respectively, no matter how many partitions it's touching.

  • To #4 - Could you please explain the logic on what's the benefit when using clustered key on business key rather than surrogate key? Does it apply to all SCD0,1,2? I can see that 20 mi SCD2 could cause lot of sorting when inserting new rows.

    We have successfully implemented the surrogate key scenario and looking at the exec plans we're pretty happy. In our case we don't filter on the business key as much, our common query mostly for reporting is joining multiple dimensions (ex. 10-20) to fact table to show all business keys and descriptions and filter on 1 to 2 dim business keys. In that way we get clustered index seeks for the joins and nonclustered index seeks (or scans depending on the filter) for the 1 to 2 dimensions that are part of the filter. If I picture (as I have not tried it) your scenario placing the clustered indexes on business keys, most likely I would get several non clustered index scans for the joins and clustered index seeks for the filters. To me such execution plan should perform worse.

  • El equipo de SQLCAT ha publicado tips sobre la optimización de DW en SQL Server 2008, pueden consultar

  • El equipo de SQLCAT ha publicado tips sobre la optimización de DW en SQL Server 2008, pueden consultar

  • Pingback from  Mejores prácticas para construir DatawareHouse « Eduardo Castro Blog

  • El equipo de SQLCAT ha publicado tips sobre la optimización de DW en SQL Server 2008, pueden consultar

  • El equipo de SQLCAT ha publicado tips sobre la optimización de DW en SQL Server 2008, pueden consultar

  • Pingback from  Large Scale EDW Best Practices « Business Intelligence 101

  • I like data partitioning in warehouse systems. The only problem with data partitioning is maintenance.  Any have have experience on SAP BI systems data portioning? Please really appreciate your response.

  • Good article. Anyone have experience on SAP BI systems data partitioning? How you implement data partitioning on SAP BI systems? You response is much appreciate.

  • If you are looking for business ideas to start a business, you are lucky to drop in at the right place. Welcome to our site, where you will get a plethora of business ideas from one of the top notch companies in the arena of franchise business

Page 1 of 2 (26 items) 12
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Post