Implement a Slowly Changing Type 2 Dimension in SQL Server Integration Services - Part 1

By:   |   Updated: 2018-08-09   |   Comments (8)   |   Related: 1 | 2 | 3 | 4 | More > Integration Services Development


Problem

We’d like to keep history in our data warehouse for several dimensions. We use SQL Server Integration Services (SSIS) to implement the ETL (Extract Transform and Load). We tried the built-in Slowly Changing Dimension wizard, but the performance seems poor. How can we implement the desired functionality with regular SSIS components?

Solution

Introduction to Slowly Changing Dimensions

A slowly changing dimension (SCD) keeps track of the history of its individual members. There are several methods proposed by Ralph Kimball in his book The Datawarehouse Toolkit:

  • Type 1 – Overwrite the fields when the value changes. No history is kept.
  • Type 2 – Create a new line with the new values for the fields. Extra columns indicate when in time a row was valid.
  • Type 3 – Keep the old value of a column in a separate column.
  • There are more types of SCDs, but they are mostly a hybrid combination of the above.

In this tip, we’ll focus on the type 2 situation. Let’s illustrate with an example. We have a simple table storing customer data.

scd2 example step 1

The SK_Customer column is a column with an identity property which will generate a new value for every row. We’d like to keep history for the Location attribute. When the location changes from Antwerp to Brussels, we don’t update the row, but we insert a new record:

scd example 2

Using the ValidFrom and ValidTo fields, we indicate when a record was valid in time. A new surrogate key is generated, but the business key – CustomerName – remains the same. When a fact table is loaded, a lookup will be done on the customer table. Depending on the timestamp of the fact record, one of the two rows will be returned. For example:

scd example fact

All facts are for the same customer. When you would ask the total sales amount for CustomerA, the result is 31. The total sales per location is 12.75 for Antwerp and 18.25 for Brussels, even though the data is for the same customer. Using SCD Type 2, we can analyze our data with historical attributes.

Implementation Methods

There are several methods for loading a Slowly Changing Dimension of type 2 in a data warehouse. You could opt for a pure T-SQL approach, either with multiple T-SQL statements or by using the MERGE statement. The latter is explained in the tip Using the SQL Server MERGE Statement to Process Type 2 Slowly Changing Dimensions.

With SSIS, you can use the built-in Slowly Changing Dimension wizard, which can handle multiple scenarios. This wizard is described in the tips Loading Historical Data into a SQL Server Data Warehouse and Handle Slowly Changing Dimensions in SQL Server Integration Services. The downside of this wizard is performance: it uses the OLE DB Command for every update, which can result in poor performance for larger data sets. If you make changes to the data flow to solve these issues, you can’t run the wizard again as you would lose all changes.

The last option – aside from using 3rd party components – is building the SCD Type 2 logic yourself in the data flow, which we’ll describe in the next section.

Implementation in SSIS

The solution proposed in this tip works for any version of SSIS. We’ll reprise the example of the customer dimension, but an extra field has been added: the email attribute. We don’t keep history of the email addresses, so any new value will overwrite all other values.

First, we read the data from a source, most likely a staging environment. When using a relational source, you can use the OLE DB Source component with a SQL query to read the data. Only select columns you actually need for your dimension. The ValidFrom field is calculated here as well. Since you typically load data from the previous day, ValidFrom is set to the date of yesterday. If you don’t have a relational source, you can add this column using a Derived Column transformation.

read data from source

In the next step, we’re doing a lookup against the dimension. Here we’ll verify if the incoming rows are either an insert or an update. If no match is found, the row is a new dimension member and it needs to be inserted. If a match is found, the dimension member already exists and we’ll need to check for SCD Type 2 changes. Unless you have a very large dimension, you can use the full cache:

ssis lookup transformation editor

Configure the lookup transformation to send non-matching rows to the no match output. In SSIS 2005 this option doesn’t exist yet, so you can either use the error output, or set the transformation to ignore failures and split out inserts and updates using a conditional split.

In the Connection pane, the following SQL query fetches the surrogate key, the business key (CustomerName) and the SCD Type 2 columns. For each member, only the most recent row is retrieved, by filtering on the ValidTo field.

retrieve dimension data

The Location field is renamed to Location_OLD for clarity. In the Columns pane, match on the business key and select all other columns.

matching and retrieval

Now we can add an OLE DB Destination on the canvas. This destination will write all new rows to the dimension. Connect the Lookup No Match Output of the lookup transformation to the destination.

writing new rows

On the mapping pane, map the columns of the data flow with the columns of the dimension table.

map new rows

The SK_Customer column is left unmapped, as it’s an IDENTITY column and its values are generated by the database engine. The ValidTo column is also left blank. New rows have no value for this column.

In part 2 of this tip we’ll continue our configuration of the data flow, where we’ll check if a row is a type 2 update or not.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a seasoned business intelligence consultant at AE. He has over a decade of experience with the Microsoft Data Platform in numerous industries. He holds several certifications and is a prolific writer contributing content about SSIS, ADF, SSAS, SSRS, MDS, Power BI, Snowflake and Azure services. He has spoken at PASS, SQLBits, dataMinds Connect and delivers webinars on MSSQLTips.com. Koen has been awarded the Microsoft MVP data platform award for many years.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2018-08-09

Comments For This Article




Monday, October 23, 2023 - 3:25:51 AM - saif Back To Top (91695)
datasets are not available ?

Wednesday, January 4, 2023 - 5:58:00 AM - Gmanunta Back To Top (90795)
I am very happy to share this article on my LinkedIn

Monday, February 22, 2021 - 5:06:08 AM - Koen Verbeeck Back To Top (88274)
Hi Dom,

it seems indeed the create table statement is not included for the customer dimension.
It's a simple table though: one int column with an identity, two string columns and two date columns.

Regards,
Koen

Wednesday, February 17, 2021 - 5:29:59 AM - Dom Back To Top (88252)
Hi, really good article. I'm in the process of building a similar process, I'm slightly confused by the customer dimension. No reference to how this is created or where it is coming from?

Thursday, June 11, 2020 - 8:18:27 AM - Koen Verbeeck Back To Top (85951)

Hi Muhammed,

I would load the CSV file into a staging table. Then I would just follow the guidelines of this tip. Make sure you also grab part 2 (https://www.mssqltips.com/sqlservertip/5639/optimizing-type-2-slowly-changing-dimensions-in-integration-services--part-2/) and part 3 (https://www.mssqltips.com/sqlservertip/5640/additional-optimizations-for-type-2-slowly-changing-dimensions-in-sql-server-integration-services--part-3/) of this tip.


Tuesday, June 9, 2020 - 12:11:20 PM - Muhammad Shakir Back To Top (85879)

I have a csv file and want to load in sql server database using ssis package and want to apply SCD 2 please help me.


Friday, November 30, 2018 - 2:02:29 AM - Koen Verbeeck Back To Top (78366)

Hi Bhargav,

with an IDENTITY constraint.

Regards,
Koen


Thursday, November 29, 2018 - 2:51:47 PM - Bhargav Kandala Back To Top (78362)

Thanks for such a nice article. How do you generate the surrogate key?















get free sql tips
agree to terms