Author: Thomas Kejser
Reviewers: Stuart Ozer, Kun Cheng, Lindsey Allen, Mike Ruthruff and Mark Souza, Prem Mehra, Alex Verbidski
Background information
As you may be aware, SQL Server, through SQLOS, implements its own scheduling mechanism on top of the Windows operating system. This is done to spend the maximum amount of CPU time in user mode by using yielding instead of preemptive scheduling. Also, SQLOS can exercise very fine control over the threads by providing an abstraction on top of Windows threads.
Central to the SQLOS scheduling mechanism is the scheduler object. A SQLOS scheduler is an abstraction of a CPU or, in the case of multi-core machines, a CPU-core. Schedulers are grouped into nodes; a node corresponds either to the hardware NUMA nodes on the host machine or to the soft NUMA configuration of SQL Server. For example, an 8 CPU dual core machine with 2 hardware NUMA nodes and no soft NUMA configured has 2 nodes with 8 schedulers in each node.
You can view the SQLOS schedulers and nodes by using the DMV sys.dm_os_schedulers. Notice that there are some extra, special schedulers in this DMV; these are for SQL Server internal use. Also, you will see the dedicated admin connection (DAC) scheduler here (scheduler_id = 255). The schedulers that do the actual query execution work are marked as VISIBLE ONLINE in the status column.
At connection time, the user’s session is assigned to a specific node. SQL Server uses a round-robin assignment mechanism to assign the connection to the node. Once a session is on a specific node, it will not move from it for the duration of the connection.
Whenever a session sends a batch request or RPC to the server, a scheduler is assigned to handle the full execution of the request. In SQLOS, this call is known as a task. The assignment of a scheduler is done by identifying the least busy scheduler within the session’s node, although the scheduler that was used for a prior task on the same connection is somewhat favored. The assigned scheduler will be used for the duration of the task—it is not possible for SQLOS to do a scheduler switch inside a task.
BULK INSERT commands
There are cases when SQL Server assignment of schedulers is less than optimal. Remember that the scheduler is an SQLOS internal abstraction of a CPU Core. So, if two long-running, CPU-bound queries end up on the same scheduler, they will compete for the same CPU resources. At the same time, another idle scheduler may have CPU time available. Ideally, the idle schedulers would take some of the load from the scheduler that is executing two long-running queries. But remember, the scheduler switch cannot happen in the middle of a task.
Multiple BULK INSERT tasks fit this scenario precisely: a bulk insert, assuming good I/O, is CPU bound and typically runs for a long time. If you execute more than one of them concurrently, you can end up in a situation where two bulk inserts run on the same scheduler, while another scheduler is idle.
If the above situation occurs, your wait stats will show signal waits, even though the system is not actually under CPU pressure. If two CPU-bound queries share a scheduler, each one will periodically yield to the other. These yields show up in sys.dm_os_wait_stats as waits for SOS_SCHEDULER_YIELD.
Summing up, if you see the following pattern on your server:
-
Total CPU load less than 100%
-
More than one long-running, CPU-bound query is executing on the same scheduler
-
Your queries are individually CPU bound, but you are still not able to push CPU load to 100%
-
Many signal waits in sys.dm_os_wait_stats
-
Many waits for SOS_SCHEDULER_YIELD in sys.dm_os_wait_stats
… You may have less than optimal scheduler distribution.
The solution: terminate and reconnect
In a well-planned, long-running BULK INSERT workload you may be able to do better than the SQL Server scheduler assignment method. You can actually leverage the fact that a task stays on the same scheduler during its run. The DMV sys.dm_exec_requests contains information about all sessions, including the scheduler that is currently executing your connection. Using this view, you can check to see if the current scheduler is busy. If it is, you can have your client application retry the connection.
Using this “terminate and reconnect” trick we were able to increase the throughput of a CPU-bound batch run by more than 25% on a large 64-core computer. Before we applied this trick, we might have 16 cores almost idle and still we observed SOS_SCHEDULER_YIELD waits. Your mileage may vary depending on the type of work you do and how “lucky” you get with the assignment of schedulers.
The following code snippet can be used to build a "terminate and reconnect" wrapper:
CREATE PROCEDURE Batch.Wrapper_DoWork
AS /* Get my scheduler */
DECLARE @my_scheduler_id INT
SELECT @my_scheduler_id = scheduler_id
FROM sys.dm_exec_requestsWHERE session_id = @@SPID
/* Check if someone else is doing long running work on my scheduler */
IF EXISTS (SELECT *
FROM sys.dm_exec_requests
WHERE scheduler_id <> @my_scheduler_id
AND command LIKE 'BULK%' /* replace with your specific check for long running query*/
)
BEGIN
RETURN 0 /* Failed to get a non busy scheduler, let client try again */
END
ELSE BEGIN
/* Do long running query here */
RETURN 1
END
Remember that your client application must perform a reconnect if the wrapper returns 0.
The above stored procedure is subject to some race conditions. You can end up in a situation where the scheduler check shows that your scheduler is not busy but in the meantime, before you start your long-running query, another query connects to your scheduler and starts its long running work.
One way to avoid this race condition is to add a semaphore to your batch control system. This semaphore can then be used to ensure exclusive access to a scheduler.
An example of this implementation is:
/* Create table to act as semaphore */
CREATE TABLE Batch.ClaimedSchedulers(
scheduler_id INT PRIMARY KEY
, session_id INT
)
CREATE PROCEDURE Batch.DoWork_WithClaim
AS
/* Get my scheduler */
DECLARE @my_scheduler_id INT
SELECT @my_scheduler_id = scheduler_id
FROM sys.dm_exec_requests
WHERE session_id = @@SPID
DECLARE @RC INT /* hold the row count of inserted data */
/* Claim the scheduler as my own */
INSERT INTO Batch.ClaimedSchedulers (scheduler_id, session_id)
SELECT @my_scheduler_id, @@SPID
FROM sys.dm_os_schedulers s
LEFT JOIN Batch.ClaimedSchedulers cs WITH (TABLOCKX) /* ensure serialization */
ON cs.scheduler_id = s.scheduler_id
WHERE cs.scheduler_id IS NULL
AND s.status = 'VISIBLE ONLINE'
AND s.scheduler_id = @my_scheduler_id
SET @RC = @@ROWCOUNT/* Did my scheduler claim fail? (no row will be inserted) */
IF @RC = 0 BEGIN
RETURN 0 /* Could not get a non busy scheduler, let client try again */
END
ELSE BEGIN
/* DO LONG RUNNING QUERY HERE */
/* Release my scheduler again */
DELETE FROM Batch.ClaimedSchedulers
WHERE session_id = @@SPID
RETURN 1
END