Resolving scheduler contention for concurrent BULK INSERT

Rate This
  • Comments 5

 

Author: Thomas Kejser 

Reviewers: Stuart Ozer, Kun Cheng, Lindsey Allen, Mike Ruthruff and Mark Souza, Prem Mehra, Alex Verbidski

Background information

As you may be aware, SQL Server, through SQLOS, implements its own scheduling mechanism on top of the Windows operating system. This is done to spend the maximum amount of CPU time in user mode by using yielding instead of preemptive scheduling. Also, SQLOS can exercise very fine control over the threads by providing an abstraction on top of Windows threads.

Central to the SQLOS scheduling mechanism is the scheduler object. A SQLOS scheduler is an abstraction of a CPU or, in the case of multi-core machines, a CPU-core. Schedulers are grouped into nodes; a node corresponds either to the hardware NUMA nodes on the host machine or to the soft NUMA configuration of SQL Server. For example, an 8 CPU dual core machine with 2 hardware NUMA nodes and no soft NUMA configured has 2 nodes with 8 schedulers in each node.

You can view the SQLOS schedulers and nodes by using the DMV sys.dm_os_schedulers. Notice that there are some extra, special schedulers in this DMV; these are for SQL Server internal use. Also, you will see the dedicated admin connection (DAC) scheduler here (scheduler_id = 255). The schedulers that do the actual query execution work are marked as VISIBLE ONLINE in the status column.

At connection time, the user’s session is assigned to a specific node. SQL Server uses a round-robin assignment mechanism to assign the connection to the node. Once a session is on a specific node, it will not move from it for the duration of the connection.

Whenever a session sends a batch request or RPC to the server, a scheduler is assigned to handle the full execution of the request. In SQLOS, this call is known as a task.  The assignment of a scheduler is done by identifying the least busy scheduler within the sessions node, although the scheduler that was used for a prior task on the same connection is somewhat favored. The assigned scheduler will be used for the duration of the task—it is not possible for SQLOS to do a scheduler switch inside a task.

BULK INSERT commands

There are cases when SQL Server assignment of schedulers is less than optimal. Remember that the scheduler is an SQLOS internal abstraction of a CPU Core. So, if two long-running, CPU-bound queries end up on the same scheduler, they will compete for the same CPU resources. At the same time, another idle scheduler may have CPU time available. Ideally, the idle schedulers would take some of the load from the scheduler that is executing two long-running queries. But remember, the scheduler switch cannot happen in the middle of a task.

Multiple BULK INSERT tasks fit this scenario precisely: a bulk insert, assuming good I/O, is CPU bound and typically runs for a long time. If you execute more than one of them concurrently, you can end up in a situation where two bulk inserts run on the same scheduler, while another scheduler is idle.

If the above situation  occurs, your wait stats will show signal waits, even though the system is not actually under CPU pressure. If two CPU-bound queries share a scheduler, each one will periodically yield to the other. These yields show up in sys.dm_os_wait_stats as waits for SOS_SCHEDULER_YIELD.

Summing up, if you see the following pattern on your server:

  • Total CPU load less than 100%
  • More than one long-running, CPU-bound query is executing on the same scheduler
  • Your queries are individually CPU bound, but you are still not able to push CPU load to 100%
  • Many signal waits in sys.dm_os_wait_stats
  • Many waits for SOS_SCHEDULER_YIELD  in sys.dm_os_wait_stats

… You may have less than optimal scheduler distribution.

The solution: terminate and reconnect

In a well-planned, long-running BULK INSERT workload you may be able to do better than the SQL Server scheduler assignment method. You can actually leverage the fact that a task stays on the same scheduler during its run. The DMV sys.dm_exec_requests contains information about all sessions, including the scheduler that is currently executing your connection. Using this view, you can check to see if the current scheduler is busy. If it is, you can have your client application retry the connection.

Using this “terminate and reconnect” trick we were able to increase the throughput of a CPU-bound batch run by more than 25% on a large 64-core computer. Before we applied this trick, we might have 16 cores almost idle and still we observed SOS_SCHEDULER_YIELD waits. Your mileage may vary depending on the type of work you do and how “lucky” you get with the assignment of schedulers.

The following code snippet can be used to build a "terminate and reconnect" wrapper:

 
CREATE PROCEDURE Batch.Wrapper_DoWork
AS /* Get my scheduler */

DECLARE
@my_scheduler_id INT
SELECT @my_scheduler_id = scheduler_id
FROM sys.dm_exec_requestsWHERE session_id = @@SPID 

/* Check if someone else is doing long running work on my scheduler */
IF EXISTS (SELECT *   
  FROM sys.dm_exec_requests 
 
 
WHERE scheduler_id <> @my_scheduler_id 
   
AND command LIKE 'BULK%' /* replace with your specific check for long running query*/
 
 
)
BEGIN 
   
RETURN 0 /* Failed to get a non busy scheduler, let client try again */
END
ELSE BEGIN 
   
/* Do long running query here */
 
   
RETURN 1
END

Remember that your client application must perform a reconnect if the wrapper returns 0.
 
The above stored procedure is subject to some race conditions. You can end up in a situation where the scheduler check shows that your scheduler is not busy but in the meantime, before you start your long-running query, another query connects to your scheduler and starts its long running work.

One way to avoid this race condition is to add a semaphore to your batch control system. This semaphore can then be used to ensure exclusive access to a scheduler.

An example of this implementation is:


/* Create table to act as semaphore */
CREATE TABLE Batch.ClaimedSchedulers( 
  scheduler_id INT PRIMARY KEY
 
  , session_id INT
) 

CREATE PROCEDURE Batch.DoWork_WithClaim
AS

/* Get my scheduler */
DECLARE @my_scheduler_id INT
SELECT @my_scheduler_id = scheduler_id
FROM sys.dm_exec_requests
WHERE session_id = @@SPID 

DECLARE @RC INT /* hold the row count of inserted data */

/* Claim the scheduler as my own */
INSERT INTO Batch.ClaimedSchedulers (scheduler_id, session_id)
SELECT @my_scheduler_id, @@SPID
FROM sys.dm_os_schedulers s
LEFT JOIN Batch.ClaimedSchedulers cs WITH (TABLOCKX)  /* ensure serialization */ 
  ON cs.scheduler_id = s.scheduler_id
WHERE cs.scheduler_id IS NULL 
  AND s.status = 'VISIBLE ONLINE'
 
  AND s.scheduler_id = @my_scheduler_id
SET @RC = @@ROWCOUNT/* Did my scheduler claim fail? (no row will be inserted) */

IF @RC = 0 BEGIN 
  RETURN 0 /* Could not get a non busy scheduler, let client try again */
END
ELSE BEGIN 
  /* DO LONG RUNNING QUERY HERE */
  

  /* Release my scheduler again */
 
  DELETE FROM Batch.ClaimedSchedulers
 
  WHERE session_id = @@SPID
  

  RETURN 1
END
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Post
  • I could have given this post the title &quot;Resolving scheduler contention for concurrent BULK INSERT&quot;

  • Pingback from  Do-it-yourself scheduling in SQL Server &laquo; A Data Professional&#039;s Blog

  • По материалам технической статьи, посвящённой SQL Server: The Data Loading Performance Guide Авторы:

  • Since partitioning is an Enterprise-only feature, I think it's worth pointing out that a similar effect can be produced using a local partitioned view, though it does require INSTEAD OF triggers on the view.

    USE tempdb;

    GO

    CREATE  TABLE

           dbo.HeavyInsert1

           (

           id      INTEGER NOT NULL,

           ptn     TINYINT NOT NULL

                   CHECK (ptn = 1),

           col1    VARCHAR(50) NOT NULL,

           PRIMARY KEY (id, ptn)

           );

    GO        

    CREATE  TABLE

           dbo.HeavyInsert2

           (  

           id      INTEGER NOT NULL,

           ptn     TINYINT NOT NULL

                   CHECK (ptn = 2),

           col1    VARCHAR(50) NOT NULL,

           PRIMARY KEY (id, ptn)

           );

    GO

    CREATE  VIEW

           dbo.HeavyInsert

    WITH    SCHEMABINDING

    AS      

    SELECT  id, ptn, col1

    FROM    dbo.HeavyInsert1

    UNION   ALL

    SELECT  id, ptn, col1

    FROM    dbo.HeavyInsert2

    GO

    CREATE  TRIGGER [trg dbo.HeavyInsert IOI]

    ON      dbo.HeavyInsert

    INSTEAD OF INSERT

    AS

    BEGIN

       SET NOCOUNT ON

       INSERT dbo.HeavyInsert (id, ptn, col1)

       SELECT id, id % 2 + 1, col1 FROM inserted

    END

    GO

    INSERT  dbo.HeavyInsert (id, col1) VALUES (1, 'TestRecord1')

    INSERT  dbo.HeavyInsert (id, col1) VALUES (2, 'TestRecord2')

    INSERT  dbo.HeavyInsert (id, col1) VALUES (3, 'TestRecord3')

    INSERT  dbo.HeavyInsert (id, col1) VALUES (4, 'TestRecord4')

    GO

    SELECT  *

    FROM    dbo.HeavyInsert

    WHERE   id = 4

    AND     ptn = (4 % 2 + 1);

    GO

    DROP    VIEW dbo.HeavyInsert;

    DROP    TABLE dbo.HeavyInsert1;

    DROP    TABLE dbo.HeavyInsert2;

  • Due to the size of the system, and especially the core count per NUMA node, was trace flag 8048 used?  I would imagine this would eliminate spinlock contention and cmemthread waits by promoting the NUMA node level resources to core level.  

    If the scheduler the task is on has other tasks and there are other free schedulers, they must be on another NUMA node.  So the point of disconnect and reconnect is to get on another NUMA node where there are idle schedulers.

    Wouldn't trace flag 8015 (ignore NUMA and manage as if one big NUMA node) achieve these benefits without the overhead of the disconnects and reconnects?

    For a system with multiple batch bulk inserts occurring in batch, there are also other possible benefits to trace flag 8015.  Managing a single memory pool can reduce buffer churn in the busiest NUMA node.  During ramp up, some of the challenges associated with bad memory sort can be avoided.

    It seems that the costs of trace flag 8048 plus trace flag 8015 on such a system would be giving up local memory access(although that's not always much of a benefit on such a system), and the reduction from one lazy writer per NUMA node to one lazy writer.

Page 1 of 1 (5 items)
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Post