SQL Server Customer Advisory Team - SQL Server Best Practices

Enabling SQL Server customers to navigate the most challenging frontiers of large scale data management.

Technical Notes

Analysis Services: Should you use many-to-many dimensions?

Author: Denny Lee
Reviewers: Mosha Pasumansky, Carl Rabeler, Richard Tkachuk

Introduction

For starters, please note that I am not saying not to use many-to-many (m:n)  dimensions in Analysis Services. This is an excellent feature that enables you to relate multiple leaf-level values to multiple parents.  Many-to-many relationships are used heavily in many business scenarios ranging from sales to accounting to healthcare. But at times there may be query performance issues when dealing with a large number of many-to-many relationships and perceived accuracy issues. The purpose of this technical note is to provide background and logical business scenarios of how many-to-many dimensions work and their potential perceived accuracy issues. 

Background

The example used in my previous (but all too brief) blog, How to handle many-to-many relationships in Yukon?, was that of a geographic dimension. But this is a later, more current technical note so let's use the Adventure Works DW OLAP and SQL database included in the Codeplex SQL Server 2005 SP2 Sample Databases.

In the AdventureWorks database, on the Dimension Usage tab of the [Adventure Works] cube, we see that the [Sales Reason] dimension is related to the measure group [Internet Sales] (heading not shown in Figure 1) by [Sales Reasons]. You can see by the infinity sign that this relationship is not that of a regular dimension (as is the [Source Currency] dimension and the [Source Currency Code] relationship) but that of a many-to-many dimension.  

image 
Figure 1: Dimension relationship view in Business Intelligence Development Studio (BIDS)

Clicking the [Sales Regions] dimension opens the Define Relationship dialog box (see Figure 2) where you will notice that the relationship type is Many-to-Many and the Intermediate measure group used to join the [Sales Reason] dimension to the [Internet Sales] measure group is [Sales Reasons].

image 
Figure 2: The Defining Relationship dialog in BIDS

The purpose of this many-to-many dimension in this scenario is graphically described in Figure 3. For this particular relationship, we want to associate multiple sales reasons (the reason a sale occurred such as the Price was low enough, it had a great demo, or some Other reason) to a single Sale event.

image
Figure 3: Dimension-to-fact relationship associating Sales Reason to Sale

For example, John purchased a new HDTV set (Fact: Sale) because there was a great sale (Sales Reason: Price) and the Super Bowl was about to start (Sales Reason: Other).

Scenario Examples

For this scenario, let's review the data in the [Adventure Works] SQL and OLAP database to better understand how many-to-many dimensions work. The [Adventure Works DW] OLAP database we looked at previously, contains the [Internet Sales] and [Sales Reasons] measure groups and the [Sales Reason] dimension. In the Adventure Works SQL database, these are the FactInternetSales, FactInternetSalesReason, and DimSalesReason tables respectively.

Continuing with our example, let's narrow the data set to just (SalesOrderNumber = 'SO69868', SalesOrderLineNumber = 2) and (SalesOrderNumber = 'SO75088', SalesOrderLineNumber = 3). 

The SQL query to create the result set in Table 1 below is:

select SalesOrderNumber, SalesOrderLineNumber, SalesAmount
  from FactInternetSales
 where (SalesOrderNumber = 'SO69868' and SalesOrderLineNumber = 2) or
       (SalesOrderNumber = 'SO75088' and SalesOrderLineNumber = 3)

When reviewing these two tuples in the FactInternetSales table, you get the following results:

SONbr SOLineNbr SalesAmount
SO69868 2 $2.29
SO75088 3 $34.99
Table 1: Two Sales Orders

When you take into account the many-to-many relationship from FactInternetSales to FactInternetSalesReasons that is used to include the SalesReason table, you get the following results:

SONbr SOLineNbr SalesAmount SRKey SRReasonName SRReasonType
SO69868 2 $2.29 1 Price Other
SO75088 3 $34.99 1 Price Other
SO75088 3 $34.99 10 Other Other
Table 2: Two Sales Orders expanded

The SQL query used to create the Table 2 result set is:

select a.SalesOrderNumber, a.SalesOrderLineNumber, a.SalesAmount,
       c.SalesReasonKey, c.SalesReasonName, c.SalesReasonReasonType
  from FactInternetSales a
    inner join FactInternetSalesReason b
        on b.SalesOrderNumber = a.SalesOrderNumber
       and b.SalesOrderLineNumber = a.SalesOrderLineNumber
      inner join DimSalesReason c
        on c.SalesReasonKey = b.SalesReasonKey
 where (a.SalesOrderNumber = 'SO69868' and a.SalesOrderLineNumber = 2) or
       (a.SalesOrderNumber = 'SO75088' and a.SalesOrderLineNumber = 3)

As you can see from these two tables, the many-to-many relationship created by the intermediary table FactInternetSalesReasons allows more than one SalesReason to be associated with a specific sale. In the case of SO75088, which has a sales amount of $34.99, both the Price and the Other reasons were associated with that sale.

 

Advantage of the Many-to-Many Dimension

The advantage of this particular approach is that by using an intermediary table, you can associate more than one dimension member to a single fact event. This is quite a powerful advantage when you start aggregating all of the values, such as when you query the [Internet Sales] measure group by the [Sales Reason] dimension as shown in Table 3.

Sales Reason Internet Sales Amount
All Sales Reasons $29,358,677.22
Demo Event (null)
Magazine Advertisement (null)
Manufacturer $5,998,122.10
On Promotion $6,361,828.95
Other $248,483.34
Price $10,975,842.56
Quality $5,549,896.77
Review $1,694,881.98
Sponsorship (null)
Television  Advertisement $27,475.82
Table 3: Internet Sales Amount values by Sales Reason (dimension)

You can get the result set shown in Table 3 by querying the [Adventure Works] cube with the following MDX statement:

select {
     [Measures].[Internet Sales Amount]
} on columns, {
     [Sales Reason].[Sales Reason].members
} on rows
from [Adventure Works]

The underlying details of the [Price] and [Other] values includes the SO75088 sale, which is associated with both of these [Sales Reason] values. What is also great is that you have the [All Sales Reasons] member, which provides the total sales amount of the $29,358,677.22. After all, if you were to run the following MDX query to get the total Internet sales:

select {
      [Measures].[Internet Sales Amount]
} on columns
from [Adventure Works]

you would get the same value. Together, this simple implementation of a many-to-many dimension enables you to associate multiple dimension values to a single event.

Perceived Accuracy Issues

A potential issue with using many-to-many dimensions is that of double counting, which can occur if users do not understand how they work. For example, if we were to query the cube by using the following MDX statement:

select {
      [Measures].[Internet Sales Amount]
} on columns, {
      [Sales Reason].[Sales Reason].children
} on rows
from [Adventure Works]

we would get the following results:

Sales Reason Internet Sales Amount
Demo Event (null)
Magazine Advertisement (null)
Manufacturer $5,998,122.10
On Promotion $6,361,828.95
Other $248,483.34
Price $10,975,842.56
Quality $5,549,896.77
Review $1,694,881.98
Sponsorship (null)
Television  Advertisement $27,475.82
Table 4: Internet Sales Amount values by Sales Reason (dimension) [repeated]

From here, some users may be tempted to calculate the total amount of sales based on the data in Table 4 by summing up the [Internet Sales Amount] column, which results in: 

Sales Reason Internet Sales Amount
Demo Event (null)
Magazine Advertisement (null)
Manufacturer $5,998,122.10
On Promotion $6,361,828.95
Other $248,483.34
Price $10,975,842.56
Quality $5,549,896.77
Review $1,694,881.98
Sponsorship (null)
Television  Advertisement $27,475.82
Total $30,856,531.52

Table 5: Summing up the Internet Sales Amount values by Sales Reason (dimension) table

Notice that the Total value of $30,856,531.52 is quite different from the previous [All Sales Reasons] value of $29,358,677.22. This is a difference of $1,497,854.30, which can be quite noticeable if you are doing financial reconciliation.

This is compounded by the fact that, with a many-to-many dimension, you do not know which members will be double counted because any sale (Internet Sales Amount) can be associated with any Sales Reason. For example, it may be tempting for some users to total the [Other] and [Price] members and conclude that they received $11,224,325.90 in sales because of the [Price] and [Other] reasons. Yet that conclusion would be incorrect because there are sales that are tagged with both the [Price] and [Other] sales reasons, so the total is less than $11,224,325.90.

Is There Another Way?

The two primary reasons to ask the question "Should I use many-to-many dimensions?" are performance and accuracy.

Performance

As noted in the Analysis Services Performance Guide, many-to-many dimensions come with performance considerations. One way to think about a many-to-many dimension is that it is a generalization of the distinct count measure. The use of many-to-many dimensions enables you to apply distinct count logic to other Analysis Services measures such as sum, count, max, min, and so on. To calculate these distinct count or sums, the Analysis Services storage engine must parse through the lowest level of granularity of data. This is because, when a query includes a many-to-many dimension, the query calculation is performed at query-time between the measure group and intermediate measure group at the attribute level. The result is a processor- and memory-intensive process to provide the calculation.

Performance issues concerning many-to-many dimensions include the following:

  • Because queries involving many-to-many dimensions result in a join between the measure group and an intermediate measure group, reduce the size of your intermediate measure group (a general rule is <1 million rows) for optimal performance. For additional techniques, see the Analysis Services Many-to-Many Dimensions: Query Performance Optimization Techniques white paper.
  • The join between the measure group and intermediate measure group is a hash join, hence it is very memory-intensive to perform this operation.
  • Unlike distinct counts, many-to-many relationships cannot be aggregated (although it generally is not very easy to create general purpose aggregates for distinct counts as well). Therefore, queries involving many-to-many dimensions cannot use aggregations or aggregate caches—only a direct hit will work. There are specific situations where many-to-many relationships can be aggregated; you can find more information in the Analysis Services Many-to-Many Dimensions: Query Performance Optimization Techniques whitepaper.
  • Since they cannot be aggregated, there are various MDX calculation issues including VisualTotals, subselects, and CREATE SUBCUBE.

 

Accuracy

As you can see from the earlier "Perceived Accuracy Issues" section, you may get undesired results (such as double counting) by using many-to-many dimensions. So, while this is a very powerful feature, if a user is not aware of the many-to-many relationships, he or she may come to the wrong conclusion as in the example where the result was a sales value that was $1,497,854.30 more than the actual value.

Analysis

You can refer to Marco Russo’s Many-to-many dimensional modeling on the many ways build solutions can be based on many-to-many dimension relationships. As with many data warehousing and business intelligence problems, there is no "one-answer-fits-all" strategy to address whether you should use many-to-many dimensions. What is important is to make sure you understand the implications of using them first.

In the [Sales Reason] scenario, there are multiple business approaches to deciding whether to use a many-to-many dimension:

You want the many-to-many relationship but will always filter by a sales reason

If you will always filter by the Sales Reason dimension, you do not actually need an intermediate measure group table. In this scenario, you can just merge the [FactInternetSales] and [FactInternetSalesReason] tables into a new fact table, and then build a new cube that will be built on top of this table. Since you will always filter by a sales reason, when you query by the Price Sales Reason, the highlighted rows shown in Table 6 will always be chosen.

SONbr SOLineNbr SalesAmount SRKey SRReasonName SRReasonType
SO69868 2 $2.29 1 Price Other
SO75088 3 $34.99 1 Price Other
SO75088 3 $34.99 10 Other Other
Table 6: Merging of the measure group and intermediate measure group—Sales Reason = Price

On the other hand, if you choose the Sales Reason of Other, only the following highlighted row is chosen.

SONbr SOLineNbr SalesAmount SRKey SRReasonName SRReasonType
SO69868 2 $2.29 1 Price Other
SO75088 3 $34.99 1 Price Other
SO75088 3 $34.99 10 Other Other
Table 7: Merging of the measure group and intermediate measure group—Sales Reason = Other

When you want to obtain a total value, such as the total sales amount, you can still create a cube that is against only the [FactInternetSales] table without the use of an intermediary measure group.

SONbr SOLineNbr SalesAmount
SO69868 2 $2.29
SO75088 3 $34.99
Table 8: "Totals" measure groups

You do not need a many-to-many relationship

In our scenario, it is possible that you actually do not need a many-to-many relationship at all. When you think about it, you may want only one reason to be associated with a sale.

SONbr SOLineNbr SalesAmount SRKey SRReasonName SRReasonType
SO75088 3 $34.99 1 Price Other

or

SONbr SOLineNbr SalesAmount SRKey SRReasonName SRReasonType
SO75088 3 $34.99 10 Other Other
Table 9: Choose one Sales Reason — not BOTH

Another approach is to split the sales amount value evenly between the two sales reasons since they are both associated with the sale.

SONbr SOLineNbr SalesAmount SRKey SRReasonName SRReasonType
SO75088 3 $17.50 1 Price Other
SO75088 3 $17.49 10 Other Other
Table 10: Split the [Sales Amount] among the two Sales Reasons

 


Discussion

Note that all of this is not to say that you should not use many-to-many dimensions. This is an excellent feature that enables users to do exactly what it was intended to do—associate more than one dimension member to an event and have the values roll up without double counting within the hierarchy of that dimension. But, it is important to note the effects of many-to-many dimensions to the overall values of the cube (for example, totaling the cells produces a value that is not actually equal to the total) and its impact on any other dimensions that may be included as well. There are also performance implications concerning the use of many-to-many dimension relationships although there are optimization techniques as described in the Analysis Services Many-to-Many Dimensions: Query Performance Optimization Techniques white paper. There may be other approaches to solving your many-to-many problems as well, so make sure to examine all options and understand the potential effects of using this cool feature.

Comments

No Comments

About denny.lee

Denny Lee is a Senior Program Manager based out of Redmond, WA in the SQLCAT Best Practices Team. He has more than 12 years experience as a developer and consultant implementing software solutions to complex OLTP and data warehousing problems. His industry experience includes accounting, human resources, automotive, retail, web analytics, telecommunications, and healthcare. He had helped create the first OLAP Services reporting application in production at Microsoft and is a co-author of “Professional Microsoft SQL Server Analysis Services 2008 with MDX”, “Professional Microsoft SQL Server 2000 Data Warehousing with Analysis Services” and “Transforming Healthcare through Information [Ed. Joan Ash] (2009)”. In addition to contributing to the SQLCAT Blog, SQL Server Best Practices, and SQLCAT.com, you can also review Denny's Space (http://denster.spaces.live.com). Denny specializes in developing solutions for Enterprise Data Warehousing, Analysis Services, and Data Mining; he also has focuses in the areas of Privacy and Healthcare.
sql, server, best practices, whitepapers, analysis services, data mining, olap, datawarehouse, datawarehousing, availability, clustering, capacity, collation, data types, data warehouse, database, design, index, mirroring, optimization, partitions, performance, precision, processing, querying, scalability, security, reporting services, integration services
Copyright 2008 Microsoft Corporation. All Rights Reserved.