Contributors: Stuart Ozer, with Prem Mehra and Kevin Cox
Technical Reviewers: Lubor Kollar, Thomas Kejser, Denny Lee, Jimmy May, Michael Redman, Sanjay Mishra
Building a large scale relational data warehouse is a complex task. This article describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning. Most of these tips are based on experiences building large data warehouses on SQL Server 2005.
Consider partitioning large fact tables
Build clustered index on the date key of the fact table
Choose partition grain carefully
Design dimension tables appropriately
Write effective queries for partition elimination
Use Sliding Window technique to maintain data
Efficiently load the initial data
Efficiently delete old data
Manage statistics manually
Consider efficient backup strategies
thanks for your tips
Recommend the use of a physical archive for each CPU and each drive for each database file. ETL should get benefit from multiple CPU write on a single Database
Update Statistics with full scan is not a good recomendation on specially big tables as statistics for 300 hundred million rows lasts more than 5 hours. Y have near 25 like that. Maybe better update only index statistics. (8 minutes)
Athrarrn, I fully agree with your comment on the "full scan statistics". Observe that we recommend updating stats with fullscan for the dimension tables only, not the fact tables. This may be redundant recommendation becasue for most dimension tables we will use fullscan by default anyway since on tables smaller than 1024 pages SQL Server always performs fullscan even if a smaller sample is requested.
It seems to me that you may have interesting environment - would you mind sending me email referring to this blog to "lubork at microsoft.com"? I would like to ask you some questions about your setup.
Thanks for compiling a good list of best practices for large scale data warehouses. One thing I didn't notice was any mention of aligned versus non-aligned indexing. Aligned indexing has typically been more effective in our environment but there have also been instances where we've benefited from the use of non-aligned indexes, so I thought it may warrant a mention.
Also, I was curious if you've had any experience with replicating partitioned tables. We have several partitioned tables that are replicated, thus preventing us from dropping partitions as part of a sliding window. I've heard that 2008 is supposed to support swapping of partitions, but in 2005 it appears we're out of luck.
Until we receive approval to upgrade to 2008 (lol), we're looking to develop a SSB-based solution that would replace replication on our largest tables (> 1 billion rows). Have you any experience, warnings, or suggestions for service broker with high volume partitioned tables (average of 10mm rows with spikes up to 100mm rows written per hour)?
To #3) If I'm not mistaken in SS2008 the problem with queries touching more than 1 partition was fixed and now those queries SHOULD parallelize upto MAXDOP as well. The entire query should parallelize upto MAXDOP respectively, no matter how many partitions it's touching.
To #4 - Could you please explain the logic on what's the benefit when using clustered key on business key rather than surrogate key? Does it apply to all SCD0,1,2? I can see that 20 mi SCD2 could cause lot of sorting when inserting new rows.
We have successfully implemented the surrogate key scenario and looking at the exec plans we're pretty happy. In our case we don't filter on the business key as much, our common query mostly for reporting is joining multiple dimensions (ex. 10-20) to fact table to show all business keys and descriptions and filter on 1 to 2 dim business keys. In that way we get clustered index seeks for the joins and nonclustered index seeks (or scans depending on the filter) for the 1 to 2 dimensions that are part of the filter. If I picture (as I have not tried it) your scenario placing the clustered indexes on business keys, most likely I would get several non clustered index scans for the joins and clustered index seeks for the filters. To me such execution plan should perform worse.
El equipo de SQLCAT ha publicado tips sobre la optimización de DW en SQL Server 2008, pueden consultar
Pingback from Mejores prácticas para construir DatawareHouse « Eduardo Castro Blog
Pingback from Large Scale EDW Best Practices « Business Intelligence 101
I like data partitioning in warehouse systems. The only problem with data partitioning is maintenance. Any have have experience on SAP BI systems data portioning? Please really appreciate your response.
Good article. Anyone have experience on SAP BI systems data partitioning? How you implement data partitioning on SAP BI systems? You response is much appreciate.
If you are looking for business ideas to start a business, you are lucky to drop in at the right place. Welcome to our site, where you will get a plethora of business ideas from one of the top notch companies in the arena of franchise business