Len Wyatt, SQL Server Performance Team
There is a new transform component available for SQL Server Integration Services. It’s called the Balanced Data Distributor (BDD) and the download is available here. The BDD provides an easy way to amp up your usage of multi-processor and multi-core servers by introducing parallelism in the data flow of an SSIS package.
The functionality of the BDD is very simple: It takes its input data and routes it in equal proportions to its outputs, however many there are. If you have four outputs, roughly ¼ of the input rows will go to each output. Instead of routing individual rows, the BDD operates on buffers of data, so it’s very efficient.
Some of you will already be noticing that there is no transformational value in the BDD, and no control over which data rows go to which output. You may be wondering, what the heck is the value of that?
The value of the BDD comes from the way modern servers work: Parallelism. When there are independent segments of an SSIS data flow, SSIS can distribute the work over multiple threads. BDD provides an easy way to create independent segments.
This diagram gives a trivial example:
If you run this data flow on a laptop, there probably won’t be any speed advantage, and there may even be a speed cost. But suppose you run this on a server with multiple cores and many disk spindles supporting the destination database. Then there might be a substantial speed advantage to using this data flow.
Using the BDD requires an understanding of the hardware you will be running on, the performance of your data flow and the nature of the data involved. Therefore it won’t be for everyone, but for those who are willing to think through these things there can be significant benefits. Here is my summary description of when to use BDD:
1. There is a large amount of data coming in.
2. The data can be read faster than the rest of the data flow can process it, either because there is significant transformation work to do or because the destination is the bottleneck. If the destination is the bottleneck, it must be parallelizable.
3. There is no ordering dependency in the data rows. For example if the data needs to stay sorted, don’t go and split it up using BDD.
Let’s talk about bottlenecks, since changing bottlenecks is what BDD is all about. A bottleneck is whatever limits the performance of the system. In general there are three places that could be the bottleneck in an SSIS data flow: The source, the transformations, or the destination.
If the limiting factor is the rate at which data can be read from the source, then the BDD is not going to help. It would be better to look for ways to parallelize right from the source.
If the limiting factor is the transformation work being done in the data flow, BDD can help. Imagine that there are some lookups, derived columns, fuzzy lookups and so on: These could easily be the components limiting performance. Make two or four or eight copies of the transformations, and split the data over them using the BDD. Let the processing run in parallel. If there are several transformations in the data flow, put as much as you can after the BDD, to get more things running in parallel.
If the limiting factor is the destination, BDD might be able to help - you need to determine whether the destination can be run in parallel. You might be surprised at some times when it can. One example is when loading data into a simple heap (table with no indexes) in SQL Server. With the database properly distributed over a number of disks, it is quite possible to load in parallel with good performance. When working on the ETL World Record a while ago, we used a heap for a side experiment and found that loading 56 streams concurrently into a single heap was almost as fast as loading 56 streams into 56 independent tables. Many sites already drop or disable their indexes during data loading, so this could be more of a freebie than you would expect. More recently we saw a benefit from parallel loading into SQL Server Parallel Data Warehouse (PDW). PDW is an architecture designed for parallelism!
A final case to consider is when the limiting factor is the transformation work being done in the data flow but the destination cannot receive data in parallel for some reason. In this case, consider using BDD to parallelize the transforms followed by a Union All to rejoin the data into a single flow; then a single destination can be used. Here is an illustration:
One final note: Whatever you put behind the BDD, be sure the same work is being done on all paths. It doesn’t make logical sense to have the paths be different, and from a performance point of view, you want them all to be the same speed. Remember, the “B” in BDD stands for “Balanced”.
Someday maybe SSIS will be able to do the work of the BDD automatically, but for now you have an easy way to amp up your usage of multi-processor and multi-core servers by introducing parallelism in the data flow of an SSIS package.
Very nice write-up. This will be a very handy feature in SSIS.
Good article. Great to see this functionality eventually emerging. I take it that in the second example there is a bit of code duplication? If BDD does become an automatic task do you think we could see it as a container so each task is applied across all available processors/ thread and the number of processors/ threads used can be a variable or automatically deduced by SSIS?
To wildh: Your comments are spot-on: Yes, there is duplication of the pipeline components to make this work with the current version of the BDD. Yes, having some way to demark the transforms that can be run in parallel would make sense rather than duplicating them. Yes it would be nice to then be able to specify the level of parallelism or let SSIS auto-configure it. This is all functionality we have to hope will be in a future version of SSIS. The BDD is simply a tool we can use in the interim.
Thanks for the feedback!
For a demo of the Balanced Data Distributor, go to technet.microsoft.com/.../hh369962 .
Ok..when do we use BDD and When to use Multi-Cast ?
Good info, I want to know if the task work in win server 2003 sp2?
Is there any new feature added in BBD's 2012 version?
Very nice, very t the point. Thanks for this
Enjoyed this information, we may be able to use this, but more importantly, I hope it opens the door for great products and the way those products work to create maximum efficiency.
Very useful component.
This is an useful component and I have recommended this to many SSIS developer.