Removing Duplicates Rows with SSIS Sort Transformation

By:   |   Updated: 2013-09-20   |   Comments (14)   |   Related: More > Integration Services Data Flow Transformations


Problem

I have multiple duplicate records in my SQL Server database. What is a quick and easy way to remove them using SSIS?

Solution

There are multiple ways to remove duplicate records in SQL Server. In this tip, I'll use the SSIS Sort Transformation to remove records and show you how easy it can be.

The SSIS Sort Transformation task is useful when you need to sort data into a certain sort order. You can compare it to the ORDER BY clause in a SELECT statement. Books Online explains it as:

"The Sort transformation sorts input data in ascending or descending order and copies the sorted data to the transformation output. You can apply multiple sorts to an input; each sort is identified by a numeral that determines the sort order. The column with the lowest number is sorted first, the sort column with the second lowest number is sorted next, and so on"

SSIS Sort Task in Action

First, open Visual Studio (or Business Intelligence Dev Studio if you're using pre SQL Server 2012) and create an SSIS project. Next, we can go ahead and make a connection to our database. Right click Connection Managers in Solution Explorer and choose New Connection Manager:

There are multiple ways to remove duplicate records in SQL Server

Choose your Connection Manager type. In this example, we'll use OLEDB. Next, configure the Connection Manager to point to your dataset. In this example, I'll use localhost and my Dev database:

Choose your Connection Manager type

Test the connection and click OK. Next, drag a Data Flow task from the SSIS toolbox onto the design screen:

drag a Data Flow task from the SSIS toolbox onto the design screen

Right click the Data Flow task and choose Edit. You are now inside the data flow task. This is where all the action happens. Drag an OLEDB source task from the SSIS toolbox to the design screen:

Right click the Data Flow task and choose Edit

Right click the OLEDB task and choose Edit. This screen is where we will define the connection manager we created earlier. Under OLEDB connection manager choose the connection you created. Leave data access mode as Table or view. Change the name of the table or the view to the table that has duplicate data that needs to be removed. In this example, I'll use a table named Teams:

Right click the OLEDB task and choose Edit

To preview the data click Preview. In my example, you can see I have duplicates in the Team, City and State columns:

you can see I have duplicates in the Team, City and State columns

Click OK to close the OLEDB Source task. Drag the Sort Transformation task onto the design screen. Connect the OLEDB Source task to the Sort task:

Connect the OLEDB Source task to the Sort task

Right click the Sort task and choose Edit. Here is where we can sort our data. Let's say I want to sort my data by State. Under Available Input Columns, I'll choose State:

Right click the Sort task and choose Edit

Click OK. Drag the Derived Column task from the SSIS toolbox onto the design screen. Connect the Sort task to the Derived Column task:

Drag the Derived Column task from the SSIS toolbox onto the design screen

Right click on the precedence constraint between Sort and Derived column and click Enable Date Viewer. This will allow us to view the data as it passes through the constraint:

Right click on the precedence constraint between Sort and Derived column and click Enable Date Viewer

Let's view our data sorted by State. Click the play button on the toolbar to debug:

Click the play button on the toolbar to debug

Tada! You can see the data has been sorted by State:

You can see the data has been sorted by State

Removing Duplicate Rows Using SSIS Sort

But wait....what does this have to do with removing duplicates? Close the Data Viewer and click the stop button on the toolbar to stop debugging. Right click the Sort task again and you'll notice down at the bottom, "Remove rows with duplicate values". I know, I know, you're thinking no way that it's this easy. Click the remove rows option and choose OK:

Click the remove rows option and choose OK

Click the play button on the toolbar again to view the results. On the design screen, you can see that I passed 20 rows to the sort column but the sort column only passed 11 rows to the next task. This means the transformation removed 9 duplicates based on the column state:

Click the play button on the toolbar again to view the results

The package worked the way I designed it but I don't want to remove State duplicates. I want to remove Team, City and State duplicates. Back in design view, right click the Sort task and choose Edit. Add Team and City to the input columns and click OK:

The package worked the way I designed it but I don't want to remove State duplicates

Now, click play one more time...

You can now see I have "Distinct" Team, City and State columns:

You can now see I have

Next Steps
  • To move the new dataset to a location just add a destination task in place of the derived column task.
  • View more SSIS Data Flow Transformation tips courtesy of MSSQLTips.com here


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Brady Upton Brady Upton is a Database Administrator and SharePoint superstar in Nashville, TN.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2013-09-20

Comments For This Article




Wednesday, July 24, 2019 - 11:25:41 AM - Christopher White Back To Top (81854)

Extremly helpful and best of all, it works! Thanks - You have saved me a bunch of hassle.


Friday, June 29, 2018 - 3:54:14 PM - Carlos Back To Top (76466)

thanks to Scott!  Your answer fits what I am doing.  If this somehow gets to you four years later, thanks! 


Tuesday, October 24, 2017 - 5:30:10 PM - Noha Back To Top (68754)

 

 Hello

Thank you for that nicely layout tutorial I wanted to ask is this option cheaper than distinct or there is no difference between the two? In my package I can add any of them but can't find out which option is effecient and cheaper.

 

thank you


Thursday, June 2, 2016 - 1:01:46 PM - Maggie Back To Top (41597)

 Awesome explanation.  Thank you!

 


Thursday, December 17, 2015 - 11:43:35 AM - PatrioticInfidelInPA Back To Top (40269)

 Excellent tutorial.  Just finished a class in Microsoft Virtual Acadamy on using SSIS Transformations and this was the perfect tutorial to step-by-step through them.  Great job and thank you.


Thursday, May 14, 2015 - 2:33:47 AM - Prudence M Back To Top (37170)

Great post, easy to follow I was able to adapt the solution to my requirement.

Thank you.

 


Monday, March 30, 2015 - 4:19:46 PM - Gary Back To Top (36763)

 

Hi

 

I may have missed something but when you say :-

"The package worked the way I designed it but I don't want to remove State duplicates. I want to remove Team, City and State duplicates. Back in design view, right click the Sort task and choose Edit. Add Team and City to the input columns and click OK:"

the screen pic below is the same as the first one

it needs to show Team and City ticked

good article though

 

thanks

 

 

Gary

 

 

 

 


Saturday, November 1, 2014 - 10:09:23 AM - rick willemain Back To Top (35150)

Nice, simple solution. Well presentef. Thank you


Tuesday, October 14, 2014 - 2:43:10 PM - Scott Back To Top (34957)

To select a "best" record from among duplicates, you need to define "best".  The most recent? (ORDER BY DateTime DESC).  The one with the fewest NULL values? (ORDER BY CASE WHEN ColA IS NULL THEN 1 ELSE 0 END + CASE WHEN ColB IS NULL THEN 1 ELSE 0 END + ...).  Extending the table used in this article, let's assume there is also a DateEntered column and you want to keep the most recent rows.  And to answer the second question, let's assume you want the discarded duplicate rows to go to another table.

In the data source component, use a query with a ROW_NUMBER() column instead of just the table.

SELECT ID, Team, City, State, DateEntered,
       Choice = ROW_NUMBER() OVER (PARTITION BY Team, City, State ORDER BY DateEntered DESC)
FROM dbo.Teams

Drop the Sort Transformation, because the ROW_NUMBER() function has already done all the sorting.  Add a Conditional Split transformation instead.  Send the rows with Choice=1 to the main output, and Choice>1 rows to a second output.  The main output has the unique rows you want to keep, and the second output has the duplicates.  The Choice column should be ignored in the destination components, there is no reason to save it in any tables.


Monday, July 14, 2014 - 4:48:03 AM - VanteGud Back To Top (32688)

Hello Brady,

Thanks for a nice a clean post.

Could you clarify something for me: If I have a table with, say, three columns and I do a "remove duplicates" on 'Key' And 'Value1' columns and lets say I have the following values in my columns:

Key Value1 Value2
1 Cat White
1 Cat Black
2 Parrot Red
3 Owl Blue

What would be my output of Value2 (Key=1)? White or Black? And can I add a sorting or something to control which one I get? (Time would be a good example of a needed sorting).

 

Thanks in advance,


Monday, June 2, 2014 - 11:58:36 AM - bhuvaneswara reddy Back To Top (32038)

Hi sir,

 your sended only eliminate the duplicate values, but i want eliminate duplicated values also going another table. please send the information how to do that.


Friday, May 9, 2014 - 4:12:27 AM - back Back To Top (30694)

 

hi bro if am have  a 2 rows 
such as

:id:name     :L_name:

 1  :  calos  :  NULL

 1  : NULL   :  LA

 

and i want to my data in ssis is  

:id:name     :L_name:

 1  :  calos  :  LA
 can you help me 
 

 


Monday, January 20, 2014 - 8:37:52 AM - Koen Verbeeck Back To Top (28143)

I believe it is important to notice that the sort component is a blocking transformation: it needs to load all of the source rows into memory before it even outputs one row. This package is absolutely not scalable and will eat available memory for large data sets until it comes to a grinding halt when it starts swapping out to disk.

On topic: very nice Ktutorial :)


Saturday, September 21, 2013 - 3:40:30 PM - Jesse Duncan Back To Top (26885)

That was great thanks for the in depth steps.

 















get free sql tips
agree to terms