By: Jared Westover | Updated: 2023-03-08 | Comments (6) | Related: 1 | 2 | 3 | 4 | 5 | 6 | 7 | > TSQL
Problem
Whenever someone mentions the MERGE statement, at least one person points out that the performance is sub-optimal compared to the same basic T-SQL INSERT, UPDATE, and DELETE statements. You can find a wide variety of articles either supporting or condemning MERGE. Like sushi or exercise, it's one of those SQL statements that people love or hate. Full disclosure, I like using MERGE. So, I'm putting the claims of it performing sub-optimally to the test. Perhaps by the end of this tutorial, I'll change my mind and stop using MERGE.
Solution
In this tutorial, we'll start by looking at the MERGE statement and why you would want to use it. I'll then look at alternatives to MERGE. I'll present the primary functionality MERGE provides that other options don't. We'll set up a decent size dataset and perform our comparison test. By the end of this tutorial, I hope to answer the question: Is MERGE slow? Perhaps, more importantly: Should new code projects include the MERGE statement?
Exploring the SQL Server MERGE Statement
Let's start by answering, "What is MERGE?" Microsoft defines MERGE as "a statement that runs insert, update, or delete operations on a target table from the results of a join with a source table." It seems simple, right? You have a source table, aka the source of truth. Then you have a destination table that you'll update based on the source of truth. The MERGE statement exists in other RDMS such as Oracle, MySQL, and Azure Synapse.
Data engineers commonly use MERGE for loading or syncing data into an environment, perhaps for a data warehouse. For example, if your database has an employee table synced nightly from an origination system, you could use MERGE to get the job done. MERGE inserts new employees, updates existing ones, and deletes ones missing from the source. Below is a simple example of using the MERGE statement.
MERGE dbo.DestinationTable AS dest USING dbo.SourceTable AS src -- Source Clauses ON (dest.SpecialKey = src.SpecialKey) WHEN MATCHED THEN -- Matched Clauses UPDATE SET Column1 = src.Column1, Column2 = src.Column2, Column3 = src.Column3 WHEN NOT MATCHED BY TARGET THEN INSERT ( Column1, Column2, Column3 ) VALUES (src.Column1, src.Column2, src.Column3); GO
Don't forget to include the semicolon at the end of MERGE or SQL yells at you.
Unique Advantage of MERGE
One of the main reasons I advocate MERGE is that it offers the ability to use a special OUTPUT clause to reference columns not part of the inserted or deleted tables. The standard OUTPUT clause works with INSERT, UPDATE, and DELETE; however, you can't reference columns not being inserted or deleted. Daniel Hutmacher wrote a short blog post on the topic that's worth the time investment. This technique helps replace the logic of cursors or WHILE loops.
It's impossible to perform a search without coming across articles pointing out known issues with the MERGE statement. Aaron Bertrand has written several of them, and I recommend reading them. MERGE is not a perfect statement, but this article focuses on performance. When I say performance, I care primarily about the time something takes to complete.
Alternative to MERGE
What's the alternative to MERGE? It's straightforward—you individually perform INSERT, UPDATE, and DELETE statements. Microsoft recommends not using MERGE if you perform any of those individual operations. Of course, this would also mean you don't need access to the non-inserted and deleted columns mentioned above.
Building a Test Dataset
Now, let's set up the base for our test. It's important to mimic a real-world dataset versus a simplistic one with a single integer column. The basic idea is to update one table based on the source of another. Let's name our first table "EmployeeSource" and the second "EmployeeDestination." We'll also create a trusty numbers table. I rarely delete data in the real world due to compliance regulations. So, let's include a BIT column indicating a soft delete instead of deleting. Remember, you'll need about 7 GB of free space.
USE [master]; GO IF DATABASEPROPERTYEX('MergeDemo', 'Version') IS NOT NULL BEGIN ALTER DATABASE MergeDemo SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE MergeDemo; END; GO CREATE DATABASE MergeDemo; GO ALTER DATABASE MergeDemo SET RECOVERY SIMPLE; GO USE MergeDemo; GO DROP TABLE IF EXISTS dbo.Numbers; GO DECLARE @UpperBound INT = 10010000; ;WITH cteN (Number) AS (SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id]) FROM sys.all_columns AS s1 CROSS JOIN sys.all_columns AS s2) SELECT [Number] INTO dbo.Numbers FROM cteN WHERE [Number] <= @UpperBound; CREATE UNIQUE CLUSTERED INDEX CIX_Number ON dbo.Numbers ([Number]) WITH (FILLFACTOR = 100); CREATE TABLE dbo.EmployeeDestination ( Id INT IDENTITY(1, 1) NOT NULL, EmployeeNumber CHAR(20) NOT NULL, FirstName NVARCHAR(256) NOT NULL, LastName NVARCHAR(256) NOT NULL, MiddleIntial CHAR(1) NULL, YearlySalary DECIMAL(36, 2) NOT NULL, DepartmentCode CHAR(5) NOT NULL, StartDate DATE NOT NULL, IsActive BIT NOT NULL DEFAULT 1, IsDeleted BIT NOT NULL DEFAULT 0, CONSTRAINT PK_Merge_Employee_Id PRIMARY KEY CLUSTERED (Id) ); CREATE TABLE dbo.EmployeeSource ( RowNumber INT NOT NULL, EmployeeNumber CHAR(20) NOT NULL, FirstName NVARCHAR(256) NOT NULL, LastName NVARCHAR(256) NOT NULL, MiddleIntial CHAR(1) NULL, YearlySalary DECIMAL(36, 2) NOT NULL, DepartmentCode CHAR(5) NOT NULL, StartDate DATE NOT NULL, IsActive BIT NOT NULL, IsDeleted BIT NOT NULL ); GO
The next code block performs the following actions:
- The first statement inserts 10 million rows of data into the "EmployeeDestination" table.
- The second statement inserts 1 million rows of data into the "EmployeeSource" table.
- The third statement updates 100,000 rows of data already in the "EmployeeSource" table.
- The last statement inserts 10,000 new rows of data into the "EmployeeSource" table.
After performing these actions, the goal is to insert 10,000 new rows of data and update 100,000 existing rows of data in the "EmployeeDestination" table.
INSERT INTO dbo.EmployeeDestination ( EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive ) SELECT TOP (10000000) CONCAT('E00', n.Number) AS EmployeeNumber, SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', (ABS(CHECKSUM(NEWID())) % 26) + 1, 6) AS FirstName, SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', (ABS(CHECKSUM(NEWID())) % 26) + 1, 7) AS LastName, SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', (ABS(CHECKSUM(NEWID())) % 26) + 1, 1) AS MiddleInitial, ABS(CHECKSUM(NEWID()) % 100000) + 45000 AS YearlySalary, CONCAT('M000', ABS(CHECKSUM(NEWID()) % 3) + 1) AS DepartmentCode, DATEADD(DAY, RAND(CHECKSUM(NEWID())) * (1 + DATEDIFF(DAY, '2020-01-01', '03-31-2023')), '2020-01-01') AS StartDate, CASE WHEN (n.Number % 1000) = 0 THEN 0 ELSE 1 END AS IsActive FROM dbo.Numbers n; GO INSERT INTO dbo.EmployeeSource ( RowNumber, EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted ) SELECT ROW_NUMBER() OVER (ORDER BY EmployeeNumber) AS RowNumber, EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted FROM dbo.EmployeeDestination WHERE (Id % 10) = 0; UPDATE dbo.EmployeeSource SET YearlySalary = ABS(CHECKSUM(NEWID()) % 100000) + 45000, IsDeleted = 1, DepartmentCode = 'M0004' WHERE (RowNumber % 10) = 0; INSERT INTO dbo.EmployeeSource ( RowNumber, EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsDeleted, IsActive ) SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY n.Number) AS RowNumber, CONCAT('E00', n.Number) AS EmployeeNumber, SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', (ABS(CHECKSUM(NEWID())) % 26) + 1, 6) AS FirstName, SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', (ABS(CHECKSUM(NEWID())) % 26) + 1, 7) AS LastName, SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', (ABS(CHECKSUM(NEWID())) % 26) + 1, 1) AS MiddleInitial, ABS(CHECKSUM(NEWID()) % 100000) + 45000 AS YearlySalary, CONCAT('M000', ABS(CHECKSUM(NEWID()) % 3) + 1) AS DepartmentCode, DATEADD(DAY, RAND(CHECKSUM(NEWID())) * (1 + DATEDIFF(DAY, '2020-01-01', '03-31-2023')), '2020-01-01') AS StartDate, 0 AS IsDeleted, 1 AS IsActive FROM dbo.Numbers n WHERE n.Number > 10000000; GO
At this point in the process, I made a copy-only backup that I'll restore after each test. I wanted to ensure that certain factors like statistics or cache didn't affect the results.
Executing the SQL MERGE Statement
With either of these methods, I want to ensure SQL Server only updates rows where a modification exists. Said another way, the row in the source data and destination data are different. This technique is beneficial when dealing with temporal tables. Now that we have our dataset let's first run the MERGE statement:
MERGE dbo.EmployeeDestination AS dest USING dbo.EmployeeSource AS src ON (dest.EmployeeNumber = src.EmployeeNumber) WHEN MATCHED AND EXISTS ( SELECT FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted FROM dbo.EmployeeSource EXCEPT SELECT FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted FROM dbo.EmployeeDestination ) THEN UPDATE SET FirstName = src.FirstName, LastName = src.LastName, MiddleIntial = src.MiddleIntial, YearlySalary = src.YearlySalary, DepartmentCode = src.DepartmentCode, StartDate = src.StartDate, IsActive = src.IsActive, IsDeleted = src.IsDeleted WHEN NOT MATCHED BY TARGET THEN INSERT ( EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted ) VALUES (src.EmployeeNumber, src.FirstName, src.LastName, src.MiddleIntial, src.YearlySalary, src.DepartmentCode, src.StartDate, src.IsActive, src.IsDeleted); GO
After each iteration, I executed the above statement five times with a restore of the database. This process was laborious, but I wanted consistent comparisons.
Performing INSERT Statements and UPDATE Statements
Below is the code for performing the UPDATE and INSERT in separate statements. Like the MERGE, I only want to update rows where a modification occurred.
UPDATE dest SET dest.FirstName = src.FirstName, dest.LastName = src.LastName, dest.MiddleIntial = src.MiddleIntial, dest.YearlySalary = src.YearlySalary, dest.DepartmentCode = src.DepartmentCode, dest.StartDate = src.StartDate, IsActive = src.IsActive, IsDeleted = src.IsDeleted FROM dbo.EmployeeDestination_NonMerge dest INNER JOIN dbo.EmployeeSource src ON dest.EmployeeNumber = src.EmployeeNumber -- JOIN Condition WHERE EXISTS ( SELECT FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted FROM dbo.EmployeeSource EXCEPT SELECT FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted FROM dbo.EmployeeDestination ); INSERT INTO dbo.EmployeeDestination ( EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted ) SELECT EmployeeNumber, FirstName, LastName, MiddleIntial, YearlySalary, DepartmentCode, StartDate, IsActive, IsDeleted FROM dbo.EmployeeSource src WHERE NOT EXISTS ( SELECT src.EmployeeNumber FROM dbo.EmployeeDestination dest WHERE src.EmployeeNumber = dest.EmployeeNumber ); GO
As with MERGE, I executed the two statements five times with a restored database after each run.
Reviewing the Results
I captured the performance markers below with Extended Events. It's a helpful tool that allows for comprehensive and efficient monitoring and collection of system information from SQL Server. You could also use STATISTICS TIME & IO. Now back to the results.
The numbers reflect an average of five runs. There wasn't a significant difference between individual executions.
Method | Logical Reads | Writes | CPU Time | Duration | Duration % |
---|---|---|---|---|---|
MERGE Statement | 5992820 | 99404 | 838904667 | 79088005 | 18.9% |
UPDATE/INSERT Operations | 5829529 | 99316 | 720380333 | 66536671 |
Before comparing the statements, I suspected MERGE might be slower. According to the results, MERGE is about 19% slower. The outcome surprised me. I would have guessed 3-5%. Does this change my mind? Yea, kind of. When starting a new project requiring a larger data sync, I'll likely choose the individual INSERT operations and UPDATE operations. A 19% difference is huge if your sync process takes hours. Dwain Camps found similar results when comparing the methods in this article.
Do you prefer the MERGE syntax? Then please keep on using it. I drive a particular route to my office because of the traffic pattern, even though it takes a bit longer sometimes.
I look forward to hearing about your experiences with MERGE in the comments below. If you feel the test was somehow skewed, please let me know. I based the original criteria on my specific needs.
Key Takeaways
- A handy feature of MERGE is the ability to reference columns not in the inserted or deleted tables in the OUTPUT clause.
- If you only perform individual INSERT, UPDATE, or DELETE statements, consider skipping MERGE.
- Do you like using MERGE? Keep doing so but be aware that the command might add performance overhead.
Next Steps
- Would you like to learn more about capturing performance metrics using Extended Events? Check out this link for a list of tips.
- Arshad Ali wrote a detailed article on using the MERGE statement. I highly recommend giving it a read.
- Aaron Bertrand wrote a massive collection of articles on proceeding with caution when using the MERGE statement.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2023-03-08