Author: Alexei Khalyako

Reviewers: Stuart Ozer, Thomas Kejser, Eric Kraemer, Kevin Cox, Mike Ruthruff

Technical Editor: Beth Inghram

 

       Make sure your intended workload is scan intensive. Avoid using Fast Track for query scenarios that mainly involve pinpoint lookup operations or very high volumes of small transactions, because these typically require support for more random IOPS than the Fast Track architectures are designed for.

 To determine whether your workload is scan intensive, use the techniques and recommendations described in the Fast Track Reference Guide.

       Consider using very few or no nonclustered indexes on fact tables. As stated in #1, the Fast Track architecture is designed for scan-intensive workloads. Typically, pinpoint queries that return few rows benefit from using nonclustered indexes. However, in a data warehouse environment , the query optimizer may decide to use a nonclustered index scan or seek with following clustered index lookup (in the query plan you will see a nested loop) trying to retrieve big amount of data. This will generate a lot of random I/O operations.

       Apply data-loading and data-maintenance techniques that avoid or minimize data fragmentation.

Loading data in parallel directly into a clustered index table can cause extent fragmentation. In some projects we observed that if data was loaded directly into a partition of the partition table with multiple loading processes, clustered index fragmentation exceeded 90 percent. The best recommendation here would be to load data into the new filegroup using a single-threaded INSERT SELECT WITH MAXDOP1 statement, as illustrated in picture 1. We observed that for a single INSERT thread, a DOP level of 8 or less gives acceptable fragmentation without any significant impact on the performance.

 Rebuilding the index is not the best solution, because this approach may leave some physical data fragmentation in the filegroup.

Picture 1: Architecture solution for separating volatile and static parts of the data 

       Put highly volatile tables into a separate filegroup. Highly volatile tables are tables in which many rows are deleted or updated, such as dimension tables; or smaller tables that are frequently re-created, such as staging tables.

         Defining  partitioning and filegroup strategy. We recommend using one filegroup per fact table. The practice of using a separate filegroup for each separate partition can be dangerous with the Fast Track architecture (an example in which this recommendation could be applied is described here).  Because in Fast Track data warehouses every filegroup is likely to be spread across all available LUNs, applying this best practice will result in a huge number of files and filegroups to support, becoming a challenge to manage.

 

 Picture 2. File group layout in a traditional data warehouse environment

 Picture 2 shows an example of how this layout could appear in a Fast Track environment (this configuration is NOT RECOMMENDED)

 

Picture 3. A traditional filegroup layout used with Fast Track. Note the higher number of files in the filegroups that you will need to support.

 Instead, consider the following approach for Fast Track.

Picture 4. Recommendation for Fast Track: one filegroup spread over multiple LUNs

 Spreading the partition file group across the LUNs makes sense if you spread it the same way as the database files (create one file group across all the data LUNs, then each LUN should store only one file of this filegroup)  ; however, this approach results in a HUGE number of files that you must then monitor and support.

        Very often we saw implementations where several data-loading processes contained data from the range of a current time period (like current day data or current month data) and no or very few new entries for the previous time range. As an example, imagine a fact table partitioned by a DAY KEY table, where most of the load goes against the current day’s partition, and where the previous day’s partitions (let’s call them historical) are not touched at all. The current day range becomes a highly volatile part of the table, characterized by high level of fragmentation. To address this challenge, separate fragmented current data from stable historical data by using different file groups. After data is loaded into the current data range and late-arriving data is no longer expected, move data by using INSERT…SELECT with SORT IN TEMPDB into a historical table. The following picture illustrates this approach. First, create a table for the current data range. Make sure this table is in its own file group, which is spread across the LUNS. Finally, create a partitioned "historical table" with its own file group spread again all over the LUNs. To satisfy queries against the historical and current time ranges, consider using VIEW on TOP of both tables. The downside of this approach is that it will be required to manage the view  in order to make sure that no data duplicates exposed  to the client application. However the advantages you that you get are paying off:

a.        online “SWITCH”, since you will not need to deal with the  SCM_M_LCK limitation

b.       you will have separated nicely  historical (not fragmented) and current ( volatile) data  nicely ability This approach will require you to manage the VIEW

  

 Picture 5: Architecture solution for separating volatile and static parts of the data

Note: In order to make sure that queries touch only the ranges they need, create constraints on both tables:

ALTER TABLE dbo.February_Static ADD CONSTRAINT CK_DATE CHECK ([ChargingDateTime] >= '2009-02-01' and [ChargingDateTime] <'2010-02-26')

GO

ALTER TABLE dbo.FebruaryCurrent_Day ADD CONSTRAINT CK_Current_DATE CHECK ([ChargingDateTime] >= '2009-02-08' and [ChargingDateTime] <'2010-02-29')

GO

        Instead of using the autogrow option for tempdb and log files, create them at the maximum size you expect to need. Make sure that all tempdb files are equally sized.  This may significantly save time by eliminating wait time on the new extent allocations.

         Loading data in parallel into the database is a great tool for moving a high volume of data from the destination to Microsoft SQL Server. One way to achieve parallel load is to load data into heap tables. However, this can introduce resource contentions such as PAGELATCH or ALLOC_FREESPACE_CACHE. In order to work around this situation, consider applying the hash-partitioning technique described in this technical article:

Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads.

         One of the best strategies for loading a large amount of data and avoiding fragmentation is to load data into staging table and then move the data into the target partition of the partitioned table with the INSERT SELECT statement and MAXDOP(1) options. There are two ways to do this:

a.       Use a staging table with a heap structure; load data into a partition, which may have clustered and nonclustered indexes built on it.

b.      Use a staging table that has the same index built as the target partitioned table.

 If you check the query plan of the first option, you will see that the output of the query is sorted before it is inserted into the clustered Index. The second option avoids a SORT operation.

The following pictures illustrate the query plans of these two options.

  

 

Picture 5. Inserting data from the heap staging table into the clustered index target partitioned table

Picture 6. Inserting data from the staging table into the target partitioned table, where both the staging and the partitioned tables have clustered indexes built on OrderDate

    Follow performance best practices for building large-scale data warehouses as published here: http://sqlcat.com/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx