Recently, we performed a lab test that had a large OLTP workload in the Microsoft Enterprise Engineering Center. The purpose of this lab was to take an intensive Microsoft SQL Server workload and see what happened when we scaled it up from 64 processors to 128 processors. (Note: This configuration is supported as part of the Microsoft SQL Server 2008 R2 release.). The workload had highly concurrent insert operations going to a few large tables.
As we began to scale this workload up to 128 cores, the wait stats captured were dominated by PAGELATCH_UP and PAGELATCH_EX. The average wait times were tens of milliseconds, and there were a lot of waits. These waits were not expected, or they were expected to be a few milliseconds only.
In this TechNote we will describe how we first diagnosed the problem and how we then used table partitioning to work around it.
When you see large waits for PAGELATCH in sys.dm_os_wait_stats, you will want to do the following. Start your investigation with sys.dm_os_waiting_tasks and locate a task waiting for PAGELATCH, like this:
The resource_description column lists the exact page being waited for in the format: <database_id>:<file_id>:<page_id>.
Using the resource_description column, you can now write this rather complex query that looks up all these waiting pages:
JOIN sys.schemas s ON o.schema_id = s.schema_id
The query shows that the page we are waiting for is in a clustered index, enforcing the primary key, of a table with this structure:
What is going on here, why are we waiting to access a data page in the index?
To diagnose what was happening in our large OLTP workload, it’s important to understand how SQL Server handles the insertion of a new row into an index. When a new row is inserted into an index, SQL Server will use the following algorithm to execute the modification:
Eventually, the page will also have to be flushed to disk by a checkpoint or lazy write operation.
However, what happens if all the inserted rows go to the same page? In that case, you can see a queue building up on that page. Even though a latch is a very lightweight semaphore, it can still be a contention point if the workload is highly concurrent. In this customer case, the first, and only, column in the index was a continuously increasing key. Because of this, every new insert went to the same page at the end of the B-tree, until that page was full. Workloads that use IDENTITY or other sequentially increasing value columns as primary keys may run into this same issue at high concurrency too.
Whenever many threads need synchronized access to a single resource, contention can occur. The solution is typically to create more of the contended resource. In this case, the contended resource is the last page in the B-tree.
One way to avoid contention on a single page is to choose a leading column in the index that is not continually increasing. However, this would have required an application change in the customer’s system. We had to look for a solution that could be implemented within in the database.
Remember that the contention point is a single page in a B-tree. If only there was a way to get more B-trees in the table. Fortunately, there IS a way to get this: Partition the table. The table can be partitioned in such a way that the new rows get spread over multiple partitions.
First, create the partition function and scheme:
This example uses four partitions. The number of partitions you need depends on the amount of INSERT activity happening on the table. There is a drawback to hash-partitioning the table like this: Whenever you select rows from the table, you have to touch all partitions. This means that you need to access more than one B-tree – you will not get partition elimination. There is a CPU cost and latency cost to this, so keep the number of partitions as small as possible (while still avoiding PAGELATCH). In our particular customer case, we had plenty of spare CPU cycles, so we could afford to sacrifice some time on SELECT statements, as long as it helped us increase the INSERT rate.
Second, you need a column to partition on, one that spreads the inserts over the four partitions. There was no column available in the table for this in the Microsoft Enterprise Engineering Center scenario. However, it is easy to create one. Taking advantage of the fact that the ID column is constantly increasing in increments of one, here is a simple hash function of the row:
With the HashID column, you can cycle the inserts between the four partitions. Create the clustering index in this way:
By using this new, partitioned table instead of the original table, we managed to get rid of the PAGELATCH contention and increase the insertion rate, because we spread out the high concurrency across many pages and across several partitions, each having its own B-tree structure. We managed to increase the INSERT rate by 15 percent for this customer, with the PAGELATCH waits going away on the hot index in one table. But even then, we had CPU cycles to spare, so we could have optimized further by applying a similar trick to other table with high insert rates.
Strictly speaking, this optimization trick is a logical change in the primary key of the table. However, because the new key is just extended with the hash value of the original key, duplicates in the ID column are avoided.
The single column unique indexes on a table are typically the worst offender if you are experiencing PAGELATCH contention. But even if you eliminate this, there may be other, nonclustered indexes on the table that suffer from the same problem. Typically, the problem occurs with single column unique keys, where every insert ends up on the same page. If you have other indexes in the table that suffer from PAGELATCH contention, you can apply this partition trick to them too, using the same hash key as the primary key.
Not all applications can be modified, something that is a challenge for ISVs. However, if you DO have the option of modifying the queries in the system, you can add an additional filter to queries seeking on the primary key.
Example: To get partition elimination, change this:
With partition elimination, the hash partitioning trick is almost a free treat. You will still add one byte to each row of the clustered index.
Its funny to see this new technical note posted by the SQLCAT team today because we just recently had
I understand the changes you made improved the INSERT throughput. But what kind of throughput did you get when you ran the workload without any change on 128 cores as compared to running it on 64 cores? Was it worse on 128 cores than on 64 cores? Or was it better than on 64 cores, but wasn't as good as you had expected?
Linchi: We did not have the opportunity to test a completely unchanged 64 core vs a fullly partitioned 128. You mileage will very depending on how many tables you have that need this trick. We have throughput increases up to 50% in some customer cases.I am currently lab testing a 32 core box with this trick and have so far gone from around 17K Sql tx/sex to 26 Sql tx/sec with this trick (and the box is not at 100% CPU load yet)
Note that removing PAGELATCH contention on one active table/index will often just expose another one in line to have the same optimization done.
> We did not have the opportunity to test a completely unchanged 64
> core vs a fullly partitioned 128.
Absolutely no issue with the partitioning and hashing trick. Just want to make sure that we are not talking pass each other. I was curious about the INSERT throughput of the >unchanged< original workload on 64 cores vs the unchanged original workload on 128 cores. In particular, I'm curious about whether going from 64 cores to 128 cores in itself presented any issue to the original workload. Or, was it the case that the INSERT throughput did increase with the unchanged workloadwas, bu the increase was much smaller than what you had expected?
I know the increased PAGELATCH waits caught your attention as you scaled to 128 cores. But as much as one would want to reduce the PAGELATCH waits, the waits were really secondary as the real measure of interest to the customer was the throughput.
Pingback from Log Buffer #163: a Carnival of the Vanities for DBAs | Pythian Group Blog
Pingback from Log Buffer #163: a Carnival of the Vanities for DBAs « PlanetMysql.ru – ???????????????????? ?? ???????? MySQL
div align=justify> По материалам статьи: " Resolving PAGELATCH Contention on Highly Concurrent
По материалам статьи: "Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads".
Linchi: Unfortunately, we did not get to do a 1-1 compare. One of the reasons was the we hit the PAGELATCH wait BEFORE reaching 64 cores - so we had to apply these tricks already then.
Last week, I did a lab on a 32 core box and before we reached 50% CPU load we had to apply this trick to go higher on throughput (which is now around 30K SQL tx/sec). There was a very hot table in that workload which received at least one insert with every transaction in the system.
I have a table with +2 bil records where the table has a clustered index (thus cannot be minimally logged insert) where the new records are (by multiple orders of magnitude) most always inserted on the last pages at the end of the B-tree.
So, in reading this, if I have a massively large insert (millions of records per statement) against a table with a similiar index structure, the question comes to mind...Could insert performance within a single statement be adversely affected because of PAGELATCHing when the query is 'parallelized' and the table is not partitioned (in a SQL Server 2000 environment)?
I eagerly await your reply
Dpetracuri: There is a limit to how fast you can insert data with a single statement. Note that the insert part of the query plan is single threaded. This also means that there is no need for thread coordination and hence no need for latches to be coordinated.
But if you had two or more insert statements, each adding many rows at same time - you could see this earlier than in the case we had in the lab.
tkejser: is that statement about single threading also true of partitioned tables with an aligned partitioned index or does parallelization take place during inserts to partitioned tables because of the true physical separation of the data and indexes?
Thank you for your learned reply.
Dpetrancuri: Yes, it is also true for a partitioned table. There is still a single thread feeing the rows to the insert operator - even though it hits multiple partitions.
But you have the right idea.
Howver, with a partitioned table you can switch out the individual partition and run one INSERT statement per partition. After that, you can switch the partitoins back in again. Actually I just did this on a 32 core box today. With 32 (switched out) partitions the aggregate insert throughput was around 1GB/sec (heap tables, using the new bulk logged INSERT in SQL 2008). Even under that speed, i did not run into the PAGELATCH, because each statement touched it's own partition.
Since partitioning is an Enterprise-only feature, I think it's worth pointing out that a similar effect can be produced using a local partitioned view, though it does require INSTEAD OF triggers on the view.
id INTEGER NOT NULL,
ptn TINYINT NOT NULL
CHECK (ptn = 1),
col1 VARCHAR(50) NOT NULL,
PRIMARY KEY (id, ptn)
CHECK (ptn = 2),
SELECT id, ptn, col1
CREATE TRIGGER [trg dbo.HeavyInsert IOI]
INSTEAD OF INSERT
SET NOCOUNT ON
INSERT dbo.HeavyInsert (id, ptn, col1)
SELECT id, id % 2 + 1, col1 FROM inserted
INSERT dbo.HeavyInsert (id, col1) VALUES (1, 'TestRecord1')
INSERT dbo.HeavyInsert (id, col1) VALUES (2, 'TestRecord2')
INSERT dbo.HeavyInsert (id, col1) VALUES (3, 'TestRecord3')
INSERT dbo.HeavyInsert (id, col1) VALUES (4, 'TestRecord4')
WHERE id = 4
AND ptn = (4 % 2 + 1);
DROP VIEW dbo.HeavyInsert;
DROP TABLE dbo.HeavyInsert1;
DROP TABLE dbo.HeavyInsert2;
Would you expect a page latching problem on inserts to a table with an identity key when the rows consume more than 4K bytes per row? Could you expect a hot-spot insertion problem when rows consume an average of 2K bytes?