Find and Remove Duplicate Rows from a SQL Server Table

By: Sergey Gigoyan | Updated: 2021-07-20 | Comments (13) | Related: > Database Design

Problem

According to database design best practices, a SQL Server table should not contain duplicate rows. During the database design process primary keys should be created to eliminate duplicate rows. However, sometimes we need to work with databases where these rules are not followed or exceptions are possible (when these rules are bypassed knowingly). For example, when a staging table is used and data is loaded from different sources where duplicate rows are possible. When the loading process completes, table should be cleaned or clean data should be loaded to a permanent table, so after that the duplicates are no longer needed. Therefore, an issue concerning the removal of duplicates from the loading table arises. In this tutorial let's examine some ways to solve data de-duplication needs.

Solution

We will consider two cases in this tip:

The first case is when a SQL Server table has a primary key (or unique index) and one of the columns contains duplicate values which should be removed.
The second case is that table does not have a primary key or any unique indexes and contains duplicate rows which should be removed. Let's discuss these cases separately.

How to remove duplicate rows in a SQL Server table

Duplicate records in a SQL Server table can be a very serious issue. With duplicate data it is possible for orders to be processed numerous times, have inaccurate results for reporting and more. In SQL Server there are a number of ways to address duplicate records in a table based on the specific circumstances such as:

Table with Unique Index - For tables with a unique index, you have the opportunity to use the index to order identify the duplicate data then remove the duplicate records. Identification can be performed with self-joins, ordering the data by the max value, using the RANK function or using NOT IN logic.
Table without a Unique Index - For tables without a unique index, it is a bit more challenging. In this scenario, the ROW_NUMBER() function can be used with a common table expression (CTE) to sort the data then delete the subsequent duplicate records.

Check out the examples below to get real world examples on how to delete duplicate records from a table.

Removing duplicates rows from a SQL Server table with a unique index

Test Environment Setup

To accomplish our tasks, we need a test environment which we create with the following statement:

USE master
GO

CREATE DATABASE TestDB
GO

USE TestDB
GO

CREATE TABLE TableA
(
 ID INT NOT NULL IDENTITY(1,1),
 Value INT,
 CONSTRAINT PK_ID PRIMARY KEY(ID)  
)

Now let's insert data into our new table - 'TableA' with the following statement:

USE TestDB
GO

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT *
FROM TableA

SELECT Value, COUNT(*) AS DuplicatesCount
FROM TableA
GROUP BY Value

As we can see in the result set below the values 3 and 5 exists in the 'Value' column more than once:

Identify Duplicate Rows in a SQL Server Table

Our task is to enforce uniqueness for the 'Value' column by removing duplicates. Removing duplicate values from table with a unique index is a bit easier than removing the rows from a table without it. First of all, we need to find duplicates. There are many different ways to do that. Let's investigate and compare some common ways. In the following queries below there are six solutions to find that duplicate values which should be deleted (leaving only one value):

----- Finding duplicate values in a table with a unique index
--Solution 1
SELECT a.* 
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID < MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NOT NULL

--Solution 2
SELECT a.* 
FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
WHERE a.ID=b.ID AND b.MaxValue IS NULL

--Solution 3
SELECT a.*
FROM TableA a
INNER JOIN
(
 SELECT MAX(ID) AS ID, Value 
 FROM TableA
 GROUP BY Value 
 HAVING COUNT(Value) > 1
) b
ON a.ID < b.ID AND a.Value=b.Value

--Solution 4
SELECT a.* 
FROM TableA a 
WHERE ID < (SELECT MAX(ID) FROM TableA b WHERE a.Value=b.Value GROUP BY Value HAVING COUNT(*) > 1)

--Solution 5 
SELECT a.*
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b 
ON a.ID=b.ID
WHERE b.rnk > 1

--Solution 6 
SELECT * 
FROM TableA 
WHERE ID NOT IN (SELECT MAX(ID) 
                 FROM TableA 
                 GROUP BY Value)

As we can see the result for all cases is the same as shown in the screenshot below:

Different techniques to identify duplicate rows

Only rows with ID=3, 5, 6 need to be deleted. Looking at the execution plan we can see that latest - the most 'compact' solution ('Solution 6') has a highest cost (in our example there is a primary key on the 'ID' column, so 'NULL' values are not possible for that column, therefore 'NOT IN' will work without any problem), and the second has the lowest cost:

Execution plans for the duplicate row code

Deleting Duplicate Rows in a SQL Server Table

Now by using the following queries, let's delete duplicate values from the table. To simplify our process, we will use only the second, the fifth and the sixth queries:

USE TestDB
GO

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE t
FROM TableA t
WHERE ID IN ( SELECT a.ID FROM TableA a, (SELECT ID, (SELECT MAX(Value) FROM TableA i WHERE o.Value=i.Value GROUP BY Value HAVING o.ID=MAX(i.ID)) AS MaxValue FROM TableA o) b
    WHERE a.ID=b.ID AND b.MaxValue IS NULL) 

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE a
FROM TableA a
INNER JOIN
(SELECT ID, RANK() OVER(PARTITION BY Value ORDER BY ID DESC) AS rnk FROM TableA ) b 
ON a.ID=b.ID
WHERE b.rnk>1

--Initializing the table
TRUNCATE TABLE TableA

INSERT INTO TableA(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
DELETE FROM TableA 
WHERE ID NOT IN (SELECT MAX(ID) 
                 FROM TableA 
                 GROUP BY Value)

Deleting the data and looking into the execution plans again we see that the fastest is the first DELETE statement and the slowest is the last as expected:

Query plans for deleting the duplicate data

Removing duplicates from table without a unique index in ORACLE

As a means to help illustrate our final example in this tip, I want to explain some similar functionality in Oracle. Removing duplicate rows from the table without a unique index is a little easier in Oracle than in SQL Server. There is a ROWID pseudo column in Oracle which returns the address of the row. It uniquely identifies the row in the table (usually in the database also, but in this case, there is an exception - if different tables store data in the same cluster they can have the same ROWID). The query below creates and inserts data into table in the Oracle database:

CREATE TABLE TableB (Value INT);

INSERT INTO TableB(Value) VALUES(1);
INSERT INTO TableB(Value) VALUES(2);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(4);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(5);
INSERT INTO TableB(Value) VALUES(3);
INSERT INTO TableB(Value) VALUES(5);

Now we are selecting the data and ROWID from the table:

SELECT ROWID, Value FROM TableB;

The result is below:

SELECT ROWID in Oracle to identify duplicates

Now using ROWID, we will easily remove duplicate rows from table:

DELETE TableB
WHERE  rowid not in (
                      SELECT MAX(rowid)
                      FROM  TableB
                      GROUP  BY Value
                    );

We can also remove duplicates using the code below:

DELETE from TableB o
WHERE  rowid < (
                      SELECT MAX(rowid)
                      FROM  TableB i
                      WHERE i.Value=o.Value
                      GROUP  BY Value
                    );

Removing duplicates from a SQL Server table without a unique index

Unlike Oracle, there is no ROWID in SQL Server, so to remove duplicates from the table without a unique index we need to do additional work for generating unique row identifiers:

USE TestDB
GO

CREATE TABLE TableB (Value INT)

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

SELECT * FROM TableB

; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (ORDER BY Value) AS RowID, Value
 FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID < (SELECT MAX(rowID) FROM TableBWithRowID i WHERE i.Value=o.Value GROUP BY Value)

SELECT * FROM TableB

In the code above, we are creating a table with duplicate rows. We are generating unique identifiers using the ROW_NUMBER() function and by using common table expression (CTE) we are deleting duplicates:

Removing duplicates from a SQL Server table without unique index

This code, however, can be replaced with more compact and optimal one:

USE TestDB
GO

--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value) 
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

--Deleting duplicate values
; WITH TableBWithRowID AS
(
 SELECT ROW_NUMBER() OVER (PARTITION BY Value ORDER BY Value) AS RowID, Value
  FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID > 1

SELECT * FROM TableB

Having said that, it is possible to identify the physical address of the row in SQL Server as well. Despite the fact that it is practically impossible to find official documentation about this feature, it can be used as an analog to ROWID pseudo column in Oracle. It is called %%physloc%% (since SQL Server 2008) and it is a virtual binary(8) column which shows the physical location of the row. As the value of %%physloc%% is unique for each row, we can use it as a row identifier while removing duplicate rows from a table without a unique index. Thus, we can remove duplicate rows from a table without a unique index in SQL Server like in Oracle as well as like in the case when the table has a unique index.

The first two queries below are the equivalent versions of removing duplicates in Oracle, the next two are queries for removing duplicates using %%physloc%% similar to the case of the table with a unique index, and in the last query, %%physloc%% is not used just for comparing performance of all of these options:

-- option 1
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE o
FROM
(SELECT %%physloc%% as RowID, value FROM TableB) o
WHERE o.RowID < (
                   SELECT MAX(%%physloc%%)
                   FROM TableB i
                   WHERE i.Value=o.Value
                   GROUP BY Value
                                  )

-----------
-- option 2
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE TableB
WHERE %%physloc%% not in (
                   SELECT MAX(%%physloc%%)
                   FROM TableB
                   GROUP BY Value
                                   )

-----------
-- option 3
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE b1
FROM 
(SELECT %%physloc%% as RowID, value FROM TableB) b1
INNER JOIN
(SELECT %%physloc%% as RowID, RANK() OVER(PARTITION BY Value ORDER BY %%physloc%% DESC) AS rnk FROM TableB ) b2 
ON b1.RowID=b2.RowID
WHERE b2.rnk>1

-----------
-- option 4
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

DELETE b1
FROM Tableb b1 
WHERE %%physloc%% < (SELECT MAX(%%physloc%%) 
                     FROM Tableb b2 
                     WHERE b1.Value=b2.Value 
                     GROUP BY Value 
                     HAVING COUNT(*) > 1
                    )

-----------
-- option 5
--Initializing the table
TRUNCATE TABLE TableB

INSERT INTO TableB(Value)
VALUES(1),(2),(3),(4),(5),(5),(3),(5)

; WITH 
TableBWithRowID AS
(
   SELECT ROW_NUMBER() OVER (partition by Value ORDER BY Value) AS RowID, Value
   FROM TableB
)

DELETE o
FROM TableBWithRowID o
WHERE RowID > 1

Analyzing the Execution Plans, we can see that the first and the last queries are the fastest when compared to the overall batch times:

Hence, we can conclude that in general, using %%physloc%% does not improve the performance. While using this approach, it is very important to realize that this is an undocumented feature of SQL Server and, therefore, developers should be very careful.

There are other ways to remove duplicates which is not discussed in this tip. For example, we can store distinct rows in a temporary table, then delete all data from our table and after that insert distinct rows from temporary table to our permanent table. In this case DELETE and INSERT statements should be included in one transaction.

Conclusion

During our experience we face situations when we need to clean duplicate values from SQL Server tables. The duplicate values can be in the column which will be de-duplicated based on our requirements or the table can contain duplicate rows. In either case we need to exclude the data to avoid data duplication in the database. In this tip we explained some techniques which hopefully will be helpful to solve these types of problems.

Next Steps

Review this related information:

About the author

Sergey Gigoyan is a database professional with more than 10 years of experience, with a focus on database design, development, performance tuning, optimization, high availability, BI and DW design.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2021-07-20

Thursday, November 11, 2021 - 10:13:55 AM - AliceTeacher	Back To Top (89433)
"Remove duplicates" becomes much more complex (for me) when I have to consider both of these 2 items: 1. I want to decide what I consider a "duplicate". (Only check these 3 columns, not all of them, determines my defined "duplicate") 2. When a duplicate is found, I want to decide which of the 2 records to display. (The record that has a value in a certain column)

Wednesday, April 21, 2021 - 7:00:04 AM - Bui Tan Duoc	Back To Top (88574)
Removing duplicates rows from a SQL Server table with a unique index. I have column ID (unique index) and 4 more data columns. All data columns are the same but ID. I want to keep one? What is the sql script? Thanks.

Tuesday, September 11, 2018 - 1:28:23 PM - Sanjay Yadav	Back To Top (77508)
Thanks , very helpful to crack sql interview