Removing Duplicates from Strings in SQL Server

By: Aaron Bertrand | Updated: 2016-01-14 | Comments (15) | Related: > TSQL

Problem

I have seen many systems that accept or store multi-valued strings in parameters, variables, or columns, with no process set up for de-duplication. Typically these are comma-separated; but really, the delimiter isn't important, the problem is that there are multiple facts embedded in a single value.

Let's say we have a string like this:

 Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta

And we want to turn it into this (distinct values ordered by original appearance):

 Bravo/Alpha/Tango/Delta

Or this (distinct values ordered alphabetically):

 Alpha/Bravo/Delta/Tango

Assuming that we can't fix the application to remove duplicates before SQL Server ever sees them, and we can't fix the schema so that the data can be separated and normalized, how would we accomplish this cleanup in Transact-SQL?

Solution

In this tip, I will show you my approach to this kind of problem, which (if the application can't be fixed and the data can't be normalized) will almost always try to lean toward a set-based solution. The typical response to such a problem will often point toward some kind of looping strategy; SQL Server is not very efficient with loops.

First, we need a Numbers table in order to support string splitting. There are multiple approaches to breaking strings apart, but the Numbers table is my favorite, because it is simple to implement, has a small disk/memory footprint, performs well enough, and is more flexible than some of the other common approaches. One way to create a numbers table is as follows:

DECLARE @UpperLimit INT = 1000000;

;WITH n(rn) AS
(
  SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id])
  FROM sys.all_columns AS s1
  CROSS JOIN sys.all_columns AS s2
)
SELECT [Number] = rn
INTO dbo.Numbers FROM n
WHERE rn <= @UpperLimit;

CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number])
-- WITH (DATA_COMPRESSION = PAGE);

This will create a Numbers table with 1,000,000 rows, which by default will take about 17MB (with page compression, it will only require about 11MB). You may need more or fewer numbers depending on your usage and requirements; for this tip, you probably really only need 4,000 or 8,000 numbers to deal with the most common maximum string lengths.

As an aside, I typically create this table in a Utility database, so that a single copy can be used centrally from any other database. You can use synonyms to point to it so that your code does not need to use three-part names.

Now with the Numbers table in place, you can create a function that uses the table to break strings apart:

CREATE FUNCTION dbo.SplitString
(
  @List  NVARCHAR(MAX),
  @Delim NVARCHAR(32)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
  RETURN
  (
    SELECT 
      rn, 
      vn = ROW_NUMBER() OVER (PARTITION BY [Value] ORDER BY rn), 
      [Value]
    FROM 
    ( 
      SELECT 
        rn = ROW_NUMBER() OVER (ORDER BY CHARINDEX(@Delim, @List + @Delim)),
        [Value] = LTRIM(RTRIM(SUBSTRING(@List, [Number],
                  CHARINDEX(@Delim, @List + @Delim, [Number]) - [Number])))
      FROM dbo.Numbers
      WHERE Number <= LEN(@List)
      AND SUBSTRING(@Delim + @List, [Number], LEN(@Delim)) = @Delim
    ) AS x
  );

As a quick test, we can see how this splitting function outputs our original string (and how we can use the rn and vn output columns to sort and remove duplicates, respectively):

DECLARE @List NVARCHAR(MAX), @Delim NVARCHAR(32);

SELECT @List  = N'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta', 
       @Delim = N'/';

SELECT rn, vn, [Value] FROM dbo.SplitString(@List, @Delim);

Results:

  rn      vn      Value
  ------  ------  -----
  2       1       Alpha
  7       2       Alpha
  1       1       Bravo
  3       2       Bravo
  6       3       Bravo
  5       1       Delta
  8       2       Delta
  4       1       Tango
  
--^ any vn <> 1 can be filtered out as a duplicate
        
--^ row number can be used to preserve original order

Finally, we need a function that will put the strings back together, using FOR XML PATH:

CREATE FUNCTION dbo.ReassembleString
(
  @List  NVARCHAR(MAX),
  @Delim NVARCHAR(32),
  @Sort  NVARCHAR(32)
)
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN
  RETURN 
  ( 
    SELECT newval = STUFF((
     SELECT @Delim + x.[Value] 
     FROM dbo.SplitString(@List, @Delim) AS x
     WHERE (x.vn = 1) -- filter out duplicates
     ORDER BY CASE @Sort
       WHEN N'OriginalOrder' THEN CONVERT(INT, x.rn)
       WHEN N'Alphabetical'  THEN CONVERT(NVARCHAR(4000), x.value)
       ELSE CONVERT(SQL_VARIANT, NULL) END
     FOR XML PATH, TYPE).value(N'.[1]',N'nvarchar(max)'),1,1,N'')
  );
END

With these in place, removing duplicates from our original string and reassembling it in the desired order becomes quite trivial:

DECLARE @List NVARCHAR(MAX), @Delim NVARCHAR(32);

SELECT @List  = N'Bravo/Alpha/Bravo/Tango/Delta/Bravo/Alpha/Delta', 
       @Delim = N'/';

SELECT OriginalOrder = dbo.ReassembleString(@List, @Delim, N'OriginalOrder'),
       Alphabetical  = dbo.ReassembleString(@List, @Delim, N'Alphabetical');

Results:

OriginalOrder              Alphabetical
-----------------------    -----------------------
Bravo/Alpha/Tango/Delta    Alpha/Bravo/Delta/Tango

Now, this is not going to be very efficient at scale, but should be quite acceptable for smaller sets, or one-time mass updates - certainly fast compared to looping constructs you may be using now.

Next Steps

Implement a Numbers table in a central utility database, and create the two functions to assist with string splitting and assembling.
See these related tips and other resources:

About the author

Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2016-01-14

Thursday, November 8, 2018 - 3:38:18 PM - Dan Carollo	Back To Top (78194)
The new STRING_SPLIT() function in SQL 2016 also handles dedups!

Wednesday, January 20, 2016 - 11:09:03 AM - Aaron Bertrand	Back To Top (40453)
Steve, I think you will find that it is over-engineered because it will prove to give better performance than a looping/iteration function; I also missed how yours guarantees that the returned string will be generated in the same order (or how a user could call the function and ask for alphabetical instead).

Removing Duplicates from Strings in SQL Server

Problem

Solution

Next Steps

About the author

Comments For This Article