As you may be aware, SQL Server, through SQLOS, implements its own scheduling mechanism on top of the Windows operating system. This is done to spend the maximum amount of CPU time in user mode by using yielding instead of preemptive scheduling. Also, SQLOS can exercise very fine control over the threads by providing an abstraction on top of Windows threads.
Central to the SQLOS scheduling mechanism is the scheduler object. A SQLOS scheduler is an abstraction of a CPU or, in the case of multi-core machines, a CPU-core. Schedulers are grouped into nodes; a node corresponds either to the hardware NUMA nodes on the host machine or to the soft NUMA configuration of SQL Server. For example, an 8 CPU dual core machine with 2 hardware NUMA nodes and no soft NUMA configured has 2 nodes with 8 schedulers in each node.
You can view the SQLOS schedulers and nodes by using the DMV sys.dm_os_schedulers. Notice that there are some extra, special schedulers in this DMV; these are for SQL Server internal use. Also, you will see the dedicated admin connection (DAC) scheduler here (scheduler_id = 255). The schedulers that do the actual query execution work are marked as VISIBLE ONLINE in the status column.
At connection time, the user’s session is assigned to a specific node. SQL Server uses a round-robin assignment mechanism to assign the connection to the node. Once a session is on a specific node, it will not move from it for the duration of the connection.
Whenever a session sends a batch request or RPC to the server, a scheduler is assigned to handle the full execution of the request. In SQLOS, this call is known as a task. The assignment of a scheduler is done by identifying the least busy scheduler within the session’s node, although the scheduler that was used for a prior task on the same connection is somewhat favored. The assigned scheduler will be used for the duration of the task—it is not possible for SQLOS to do a scheduler switch inside a task.
There are cases when SQL Server assignment of schedulers is less than optimal. Remember that the scheduler is an SQLOS internal abstraction of a CPU Core. So, if two long-running, CPU-bound queries end up on the same scheduler, they will compete for the same CPU resources. At the same time, another idle scheduler may have CPU time available. Ideally, the idle schedulers would take some of the load from the scheduler that is executing two long-running queries. But remember, the scheduler switch cannot happen in the middle of a task.
Multiple BULK INSERT tasks fit this scenario precisely: a bulk insert, assuming good I/O, is CPU bound and typically runs for a long time. If you execute more than one of them concurrently, you can end up in a situation where two bulk inserts run on the same scheduler, while another scheduler is idle.
If the above situation occurs, your wait stats will show signal waits, even though the system is not actually under CPU pressure. If two CPU-bound queries share a scheduler, each one will periodically yield to the other. These yields show up in sys.dm_os_wait_stats as waits for SOS_SCHEDULER_YIELD.
Summing up, if you see the following pattern on your server:
… You may have less than optimal scheduler distribution.
In a well-planned, long-running BULK INSERT workload you may be able to do better than the SQL Server scheduler assignment method. You can actually leverage the fact that a task stays on the same scheduler during its run. The DMV sys.dm_exec_requests contains information about all sessions, including the scheduler that is currently executing your connection. Using this view, you can check to see if the current scheduler is busy. If it is, you can have your client application retry the connection.
Using this “terminate and reconnect” trick we were able to increase the throughput of a CPU-bound batch run by more than 25% on a large 64-core computer. Before we applied this trick, we might have 16 cores almost idle and still we observed SOS_SCHEDULER_YIELD waits. Your mileage may vary depending on the type of work you do and how “lucky” you get with the assignment of schedulers.
The following code snippet can be used to build a "terminate and reconnect" wrapper:
One way to avoid this race condition is to add a semaphore to your batch control system. This semaphore can then be used to ensure exclusive access to a scheduler.
An example of this implementation is:
I could have given this post the title "Resolving scheduler contention for concurrent BULK INSERT"
Pingback from Do-it-yourself scheduling in SQL Server « A Data Professional's Blog
По материалам технической статьи, посвящённой SQL Server: The Data Loading Performance Guide Авторы:
Since partitioning is an Enterprise-only feature, I think it's worth pointing out that a similar effect can be produced using a local partitioned view, though it does require INSTEAD OF triggers on the view.
id INTEGER NOT NULL,
ptn TINYINT NOT NULL
CHECK (ptn = 1),
col1 VARCHAR(50) NOT NULL,
PRIMARY KEY (id, ptn)
CHECK (ptn = 2),
SELECT id, ptn, col1
CREATE TRIGGER [trg dbo.HeavyInsert IOI]
INSTEAD OF INSERT
SET NOCOUNT ON
INSERT dbo.HeavyInsert (id, ptn, col1)
SELECT id, id % 2 + 1, col1 FROM inserted
INSERT dbo.HeavyInsert (id, col1) VALUES (1, 'TestRecord1')
INSERT dbo.HeavyInsert (id, col1) VALUES (2, 'TestRecord2')
INSERT dbo.HeavyInsert (id, col1) VALUES (3, 'TestRecord3')
INSERT dbo.HeavyInsert (id, col1) VALUES (4, 'TestRecord4')
WHERE id = 4
AND ptn = (4 % 2 + 1);
DROP VIEW dbo.HeavyInsert;
DROP TABLE dbo.HeavyInsert1;
DROP TABLE dbo.HeavyInsert2;
Due to the size of the system, and especially the core count per NUMA node, was trace flag 8048 used? I would imagine this would eliminate spinlock contention and cmemthread waits by promoting the NUMA node level resources to core level.
If the scheduler the task is on has other tasks and there are other free schedulers, they must be on another NUMA node. So the point of disconnect and reconnect is to get on another NUMA node where there are idle schedulers.
Wouldn't trace flag 8015 (ignore NUMA and manage as if one big NUMA node) achieve these benefits without the overhead of the disconnects and reconnects?
For a system with multiple batch bulk inserts occurring in batch, there are also other possible benefits to trace flag 8015. Managing a single memory pool can reduce buffer churn in the busiest NUMA node. During ramp up, some of the challenges associated with bad memory sort can be avoided.
It seems that the costs of trace flag 8048 plus trace flag 8015 on such a system would be giving up local memory access(although that's not always much of a benefit on such a system), and the reduction from one lazy writer per NUMA node to one lazy writer.