A Real Practices Case Study at Xbox LIVE
Authors: Joseph Szymanski, Tyson Solberg, Denny Lee
Technical Reviewers: Lindsey Allen, Akshai Mirchandani, Heidi Steen
To expand on the distinct count optimization techniques provided in the Analysis Services Distinct Count Optimization white paper, this technical note shows how using solid state devices (SSDs) can improve distinct count measures. We recount the experiences of the Microsoft Entertainment and Devices Data Warehousing Team (known for Xbox, Xbox LIVE, XNA, and Zune) in our analysis of applying SSDs to a real-world, distinct count heavy, Microsoft SQL Server Analysis Services customer environment. The key conclusion is that enterprise SSD devices, when combined with a well optimized Analysis Services MOLAP cube, will drastically improve the performance and scalability of the cube when it accesses distinct count measures. It can also improve non-distinct count measures, if the calculations being performed rely heavily on storage-engine calculations.
Analysis Services distinct count measures are extremely expensive in all aspects of an Analysis Services solution – the time requirements for processing the data, the long query durations, and large storage space requirements. Often the best approach is to convince the analysts using your cubes to use alternate measures or calculations. However, in many cases the distinct count-based Key Performance Indicators (KPIs) are key components of business analytics systems. In such cases, the focus has to move from "Are you sure you need distinct count?" to "How can we make distinct count queries fast(er)?"
One way to improve the performance of distinct count measures is to change the business problem that they attempt to measure (for example, limiting the time range for distinct counts to specific years or months rather than all years). However, when you have exhausted your ability to reduce the solution’s complexity by simplifying business processes, you can turn to other techniques to improve your solution. This paper endeavors to share several that we came across in real-world enterprise usage scenarios.
As originally described in Analysis Services Distinct Count Optimization, distinct count calculations are Storage-Engine-heavy, resulting in heavy disk I/O utilization (that is, higher latencies and slower throughput). This leads us to the first and foremost among those techniques: the use of solid state devices (SSDs), which have taken the world by storm. Given the high initial cost, relative newness, and lower capacity per device, it is often difficult to find a good fit for them in the enterprise. However, Analysis Services dovetails well with the SSD story because SSDs offer ultra-high read speeds for both random and sequential I/O. Analysis Services queries generally access a large amount of relatively static data and SSDs provide the speed and flexibility required. For example:
This paper covers the business scenario in which business users seek to improve the performance of their Analysis Services distinct count operations. As an example, the Microsoft SQLCAT team worked with the people at the Microsoft Entertainment and Devices Data Warehousing who needed to build their critical KPIs based on distinct counts. We decided that a significant research effort to find a way to make their distinct count queries run faster through design methodologies and the use of SSDs was worthwhile.
The engineering team, examining the scenario at hand, endeavored to find out the following:
a) Can distinct counts be made to run faster: a. Given a faster I/O system? b. Given better cube design?
b) Will such improvements be cost effective?
c) Will such improvements improve or hinder scalability, compared with the current design?
The database is a real-world production dataset with the following characteristics.
The cube design, for the distinct counts, is relatively simplistic, as noted in Figure 1.
Figure 1: Usage Cube Schema
One of the primary business drivers that needed an answer is: "How many distinct users used the service for <a time period>?" To get this answer, we focused on a single simple MDX statement and sought to make it run as fast as possible from cold cache (that is, we cleared all existing caches to force the data to load from the pertinent device).
SELECT [Measures].[Online Unique Users] ON 0,
[Date].[Date].[Date] ON 1
[Date].[Calendar Month].[Calendar Month].&&
The measure "Online Unique Users" is a distinct count measure that, for the time period of the month selected, scans through 300 million SQL records in the MOLAP cube dataset.
We utilized a number of servers to perform our query tests.
User Acceptance Testing Server
HP DL580 G5 4-socket, quad core, 2.4 GHz, 64 GB RAM server connected to a high end SAN array; Chart Label: "UAT SAN"
Dell R710 2-socket, quad core, 2.93 GHz, 72 GB RAM
SQL Customer Advisory Team Server
Dell R905 4-socket, quad core, AMD server with 4 Fusion-io ioDrive PCI-E SSD device; Chart Label: "SQLCAT SSD"
As originally described in Analysis Services Distinct Count Optimization, distinct count calculations are Storage-Engine-heavy, resulting in heavy disk I/O utilization. Therefore, the original hypothesis for this case study was modeled after the reasonable argument: "If we are disk I/O bound, and Analysis Services provides a 100 percent random read load, SSDs should drastically improve performance."
Let’s start by comparing the query performance between SSDs and spin disks (local hard drives and SAN).
Figure 2: Initial Performance Optimization (Lower Is Better)
As can be seen in Figure 2, the use of Fusion-io SSDs resulted in dramatically faster query performance when compared to local drives or SAN drives. For example, for the seven-month query, SSDs were 5.5 times faster than the SAN hardware.
But as with any successful BI performance improvement, Analysis Services users also asked more resource-intensive questions such as: "For the last year, how many distinct people did <X>?" That is, these questions resulted in distinct count queries that covered a year’s worth of data across a substantially larger dataset. In an attempt to find further performance improvements, the engineering team profiled the cube during query time, in production, and found a very odd thing: Disk usage and CPU usage were nowhere near the limits of what the system was able to handle.
This finding was counterintuitive, because all tests were run on a system that was dedicated to this exercise (that is, the system had no other users or processes running concurrently); it was not a matter of resource contention. We then worked with the SQLCAT team and other experts to find out why Analysis Services was not using more system resources to query the cube, starting with an investigation of how the partitions were defined.
As noted in Analysis Services Distinct Count Optimization, partitioning significantly improves distinct count query performance. By creating distinct buckets based on distinct value and time, you can significantly improve distinct count query performance by forcing the Analysis Services Storage Engine to fire off many more threads – one for each partition – and therefore more quickly calculate the distinct value. But if partitions are designed with an uneven number of distinct values (such as in Figure 3), the query may ultimately become single-threaded (even though all four partitions are being queried) because the Analysis Services Storage Engine is waiting for the largest partition (data file with values from 1,500 to 20,000) to complete its calculations. This behavior explained the puzzling results around disk and CPU consumption in the earlier tests.
Figure 3: Unevenly Distributed Partitions
A more evenly distributed set of distinct values within the partitions results in all four threads completing at approximately the same time, resulting in minimal spinning and wait time while the calculations are completed. Clearly, parallel queries across the partitions had to be part of our solution.
Figure 4: Evenly Distributed Partitions
Figure 4 shows the even distribution of data among the partitions, which is a key concept for distinct count query optimization. The Analysis Services Storage Engine will initially query the header file to determine which data partitions to query for the range of distinct count values. This way, the storage engine queries only partitions that have the values required to complete the calculation.
After extensive testing, we rediscovered some important rules and added them to the distinct count partitioning strategy to ensure all distinct count queries are optimally parallel:
The distinct count measure must be directly contained in the query.
If you partition your cube by the hash of a UserID distinct value, it is important that your query perform a distinct count of the hash of the UserID – not the distinct count of the UserID itself. For fast distinct count query performance, it is important for the distinct count value itself to be placed in its own periodic partition (for example, User 1 repeatedly shows up in only month 1 partition 1, User 100 in month 1 partition 2, and so on) and for the values to be non-overlapping (for example, Users 1-99 in month 1 partition 1, Users 100-199 in month 1, partition 2, and so on). The hashing will cause the records in the same range to be distributed across multiple partitions, therefore losing the non-overlapping behavior. Even if the UserID and the hash of the UserID have the same distribution of data, and even if you partition data by the latter, the header files contain only the range of values associated with the hash of the UserID. This ultimately means that the Analysis Services Storage Engine must query all of the partitions to perform the distinct on the UserID. For more information about these concepts, see the white paper Analysis Services Distinct Count Optimization.
The distinct count values need to be continuous.
As implied in Figures 3 and 4, each partition has a continuous range of values so that the partition contains the values from 100 – 20,000 (in this example). Based on the empirical evidence we gathered in our testing for this case, it appears that distinct count query performance improves if the values within the partitions are continuous.
After we followed these two rules, we were easily able to improve query parallelism with very few changes.
More specifically, we analyzed our data size, selected a month as the coarse time grain for the distinct count measure group partitions, and then sub-selected the data, per month, into <n> partitions, where n is the number of physical CPU cores on the OLAP hardware. We made this decision after we identified a number of options, tested them, and found this particular grain to be the best for our set of data. Other than the partition changes, the cube design stayed the same, and we did not alter any aggregations for this cube. Note, we had followed the established guidelines of the SQLCAT white paper Analysis Services Distinct Count Optimization.
Note: To allow for more repeatable distinct count query comparison, the cube used here contained no aggregations on distinct count measures.
The following lists various combinations of measure group slicer queries and distinct count measures.
The results were stunning, as shown by the "V2" lines for both SSDs and hard disk drives (HDDs) (where "V2" is the version 2 cube, which follows the optimizations discussed earlier in this paper).
Figure 5: Performance after enabling "multi-threaded mode".
(Note that in Figure 5, the V2 cube performed at the same speed, on SSDs, in all environments. We show only the SQLCAT line for simplicity.)
The conclusion is that by adding enough I/O (through SSDs so that I/O was no longer a bottleneck), we were able to find and resolve the algorithmic issues, enabling incredible performance. The key is that we never would have found these algorithmic issues without first removing the I/O channel bottlenecks by the use of SSDs.
At this point, due to operational issues, we were initially unable to go live with the SSD servers. It was agreed that given these results, we should go with the easier-to-implement DAS HDD solution, which offered similar performance. Specifically, our challenge to going live was that, due to a lack of SSD enterprise standards as of mid-2009, the supportability story was too complicated to be sustainable across a large number of servers.
In retrospect, not waiting for SSDs was a serious mistake. The cube went live and user satisfaction plummeted. What went wrong? The results shown in Figure 5 were correct, but somehow, performance was awful.
Figure 5, though accurate, shows query response times for a single user only. It does show accurately that, with a solidly parallel cube, good DAS can be nearly as fast as SSDs for reasonably large MOLAP cubes. But a more in-depth analysis found that, in our initial analysis, we failed to consider a highly parallel user load and failed to benchmark a large enough multi-user parallel Analysis Services query load.
Had this been done sooner, the patterns illustrated in Figures 6 and 7 would have been found.
Figure 6: Distinct count querying comparison of different access patterns
Figure 7: Distinct count query comparison of different access patterns (time to respond to first query)
To execute the parallelization scenario, the engineering team used a tool to execute multiple Analysis Services queries in parallel, with the following characteristics:
After we captured and analyzed this data, our conclusion was clear: Single-user queries benefitted from the use of SSDs. As soon as the number of concurrent business users increased, the benefit of SSD became even more apparent.
From a technical perspective, SSDs allow many more threads of execution to run in parallel without incurring huge I/O wait times, because their random I/O throughput is basically the same as its sequential I/O throughput. This benefit is relevant because multiple independent queries, serviced simultaneously, implicitly cause random I/O at the disk level, and unlike rotating disks, SSD devices do not slow down under random I/O. Though rotating disks slow down to a three-digit number of I/O operations per second when access is random, high-end SSD devices continue to deliver five-digit number of I/O operations per second, sequential or random. This directly translates into more parallel queries, and therefore more concurrent users, per server when its I/O system is based on high end SSD technology.
Here is a list of conclusions we drew from the work we did for the Microsoft Entertainment and Devices Data Warehousing Team. Most have been discussed in this paper, but some just general best practices we want to share with you:
The results of our analysis are clear: SSD technology has significant benefits for MOLAP-based Analysis Services solutions. Because concurrent users implicitly create random I/O patterns, solid-state devices enable greater scalability and per-user performance. In the past, before the advent of SSD technology, getting very high end parallel random I/O performance required a complex and very expensive solution. SSDs offer these benefits without the prohibitively high cost.
Pingback from Fusion IO DW stats update | Sql Server Insane Asylum
Great article! Big time saver for us. We are building a retail solution with lots of distinct customers. Thanks very much!
@bobdoss Glad to hear it! Hope all is going well with the solution and don't hesitate to ping with your experiences - good & bad, eh?!
Olvasom az Analysis Services Distinct Count Optimization Using Solid State Devices című cikket és közben
Pingback from Fusion-io ioDrive ?????? « Storage-Talk by Cozax
Pingback from SSAS White Paper list - reading for "SSAS Maestro" program | Vidas Matelis Analysis Services Blog
Microsoft SQL Server Analysis Services' distinct count measures are used by business intelligence
Pingback from Access Query Count Records | More More Pics
by Denny Lee As many of you know from the various blogs, whitepapers, and conferences from SQLCAT, there
Pingback from Why the obsession with random I/O within the context of SSAS? - MSDN Blogs
Vidas Matelis összeállított egy egész jó listát azokról a dokumentumokról, amelyek szükségesek lehetnek
Pingback from SSAS Maestro program – my experience so far | Vidas Matelis Analysis Services Blog
We are using SSDs with distinct customer counts and it is working well. However, there is one thing we're still trying to optimize. Has anyone worked out how to get Windows to not use Standby File System Cache and Windows File System Cache for distinct count partition files resident on SSDs? We just upgraded to Windows Server 2008 R2 64 Bit and I was hoping that the SuperFetch and other optimizations would be smarter about (not) loading the SSD files into Windows Cache. Early days of testing, but so far it still seems about the same as Windows 2008. Might be something subtle that further testing will reveal?
For Windows Server 2008 R2, I thought the SuperFetch itself was already disabled - here's a MSDN forum thread on the topic: social.technet.microsoft.com/.../ca451db2-762b-44b7-a100-9d0d6a4230d7
Anythign else revealing from your tests?
Thanks Danny. Running now for a week on Windows Server 2008 R2. It's good. The managment of Standby File System Cache for SSD files appears to be much improved - Windows seems to clean the cache out regularly and more aggressively for SSD files and only cache what Analysis Server needs. We had SuperFetch turned off on Windows 2008, and considerable memory was used to Cache the SSD files. Preliminary results are that Windows 2008 R2 is a big improvement.