Using Hashbytes to track and store historical changes for SQL Server data

By: Christian Allaire | Updated: 2012-07-19 | Comments (15) | Related: > Database Design

Problem

Change Data Capture (CDC) is a fundamental part of ETL especially in a data warehousing context. A way to achieve this is through hashing algorithms. Even though SSIS is a great ETL tool, it lacks such a hashing function. The scenario presented here is rather simple, but it could also be applied to more complex ETL logic. Basically what we have is a source table in which we have no reliable timestamp to identify which data has changed since the last load (as is often the case in operational systems). The table data is replicated in a staging area and we wish to store history for only changes in the data warehouse. One possible solution is to compare field by field the staging and the warehouse table, but obviously this is tedious and time consuming. We need to find an efficient and simple way to do this.

Solution

The solution presented here is to use the HashBytes SQL Server function to generate a hash code for every source record. HashBytes can use algorithms such as MD5 or SHA1. These hash functions are very sensitive to change and will generate a completely different key for a single character change.

To begin with, here is a very simple SSIS staging package:

Figure 1:

The first SQL task simply truncates the staging table before inserting the complete data from the source table. The data flow task is nothing more than a one-to-one mapping from source to staging with a few technical fields added (load_id, load_date ...)

The sql_update_checksum SQL task is where it starts to get interesting. It will bulk update a "checksum" field for all the records inserted in the staging table. Here are the configuration screenshots:

Figure 2:

The sql_update_checksum SQL task is where is starts to get interesting

This SQL task uses a direct input SQL statement, but with an input parameter.

Figure 3:

The parameter is a user-defined variable named "target_object_name" which contains the schema and staging table name we are feeding (for example "staging.customer").

Figure 4:

Next is the SQL statement with the "?" placeholder for the parameter. I use an UPDATE statement to bulk update a "checksum" field. This is were the get_hash_fields UDF is invoked and returns the list of field names which are then passed to HashBytes. The return value for HashBytes is a VARBINARY which I convert to a VARCHAR(32). Once converted to a varchar, the hash code looks like "0xEAD2DD130555A749BB128517DAA4C912". I cleanup the string by removing the first 2 characters ("0x") and putting it in lowercase.

The problem with HashBytes is that it is not very user friendly since it will only accept a single varchar or nvarchar variable. So this means that all the fields from our source records need to be concatenated to a single string; also meaning that all non-string columns need to be converted to a varchar.

So to achieve this and to make our code reusable, what better way than to encapsulate this conversion logic inside a UDF? Figure 5 shows the get_hash_fields function definition.

Figure 5:

CREATE function [dbo].[get_hash_fields] (@p_table_name VARCHAR(100),@p_schema_name VARCHAR(20))
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @sqlString as varchar(max)
    SET @sqlString = ''

    
    SELECT @sqlString = @sqlString +
      CASE DATA_TYPE 
      WHEN 'int' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ')),'''')'
      WHEN 'datetime' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ',112)),'''')'  
      WHEN 'datetime2' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ',112)),'''')'
      WHEN 'date' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ',112)),'''')' 
      WHEN 'bit' THEN 'ISNULL(RTRIM(CONVERT(varchar(1),' + COLUMN_NAME + ')),'''')'  
      WHEN 'decimal' THEN 'ISNULL(RTRIM(CONVERT(varchar('+ CONVERT(varchar(2),NUMERIC_PRECISION) +'),' + COLUMN_NAME + ')),'''')' 
      ELSE 'ISNULL(RTRIM(' + COLUMN_NAME + '),'''')'
      END + '+'
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = @p_schema_name and TABLE_NAME = @p_table_name
    AND COLUMN_NAME NOT IN ('load_dts','load_cycle_id','period','checksum')

                                
    RETURN LEFT(ISNULL(@sqlString, ''),LEN(@sqlString)-1)
END

The function receives 2 arguments: @p_table_name which is the table for which we want to generate the column list, and @p_schema_name which is the associated schema for this table. As you can see, I am using a single SELECT statement to generate a concatenated string of all the fields in the table. The INFORMATION_SCHEMA.COLUMNS system view is used to retrieve all column names for this table. I am also filtering on a list of technical fields which should not be part of the hash code since they are not received from the source, but added by the ETL logic.

The CASE statement checks the data type for each column of the table (as defined in SQL's system tables) and uses a corresponding CONVERT function to convert the column to a varchar data type. (Of course, you could add entries to this CASE statement if you are using other data types ...) So basically, the function builds a series of concatenated CONVERT statements that is used in a dynamic SQL statement in an SSIS variable.

Example tables:

Suppose we have a staging and datawarehouse customer table such as:

staging.customer

dwh.customer:

Here is a function call and its output:

function call:

select dbo.get_hash_fields('customer','staging')

output string from the function:

ISNULL(RTRIM(CUSTOMER_ID),'')+ISNULL(RTRIM(CUSTOMER_NAME),'')+ISNULL(RTRIM(CUSTOMER_ADDR1),'')+ISNULL(RTRIM(CUSTOMER_ADDR2),'')+ISNULL(RTRIM(CUSTOMER_ADDR3),'')+ISNULL(RTRIM(CONVERT(varchar(10),BUSINESS_TYPE)),'')+ISNULL(RTRIM(CONVERT(varchar(9),BALANCE_DUE)),'')+ISNULL(RTRIM(CONVERT(varchar(10),DUE_DATE,112)),'')+ISNULL(RTRIM(CONVERT(varchar(10),PAY_FREQ)),'')

So in our SSIS SQL Task code from figure 4, the function call is in bold. I hard coded the @SchemaTableName into this as "staging.customer", but this would be handled by the SSIS parameter and that is why it shows as "?" in Figure 4.

DECLARE @sqlFieldString as varchar(max)
DECLARE @sqlStatementString as varchar(max)
DECLARE @TableName as varchar(100)
DECLARE @SchemaName as varchar(20)
DECLARE @SchemaTableName as varchar(20)

 
SET @SchemaTableName = 'staging.customer'
SET @SchemaName = LTRIM(RTRIM(LEFT(@SchemaTableName,charindex('.',@SchemaTableName)-1)))
SET @TableName = LTRIM(RTRIM(RIGHT(@SchemaTableName,len(@SchemaTableName)-charindex('.',@SchemaTableName))))

 
select @sqlFieldString = [dbo].[get_hash_fields](@TableName,@SchemaName)

 
SET @sqlStatementString = 
'UPDATE ' + @SchemaName + '.' + @TableName + 
' SET [checksum] = substring(lower(convert(varchar(32), HASHBYTES(''MD5'',' + @sqlFieldString + '),1)),3,32)' 

 
EXEC (@sqlStatementString)

If we run the above code, the final result is that we have updated all the checksum fields of our "staging.customer" table. It is now ready to be used to be compared to the checksum field of the "dwh.customer" table in the data warehouse. This will allow us to detect any new changes arriving from the source system.

Inserting New Data

Once we have the checksum calculated for our staging database, basically the SQL statement to INSERT the changes to the warehouse could be as simple as this. Here I am using a LEFT JOIN to find any records that don't already exist in the "dwh.customer" table.

INSERT INTO [core].[customer]
SELECT stg.[load_dts]
   ,stg.[load_cycle_id]
   ,stg.[period]
   ,stg.[CUSTOMER_ID]
   ,stg.[CUSTOMER_NAME]
   ,stg.[CUSTOMER_ADDR1]
   ,stg.[CUSTOMER_ADDR2]
   ,stg.[CUSTOMER_ADDR3]
   ,stg.[BUSINESS_TYPE]
   ,stg.[BALANCE_DUE]
   ,stg.[DUE_DATE]
   ,stg.[PAY_FREQ]
   ,stg.[checksum]
FROM [staging].[customer] stg
LEFT JOIN [core].[customer] dwh on stg.[checksum] = dwh.[checksum]
WHERE dwh.[checksum] IS NULL

Of course I am simplifying things: the SSIS package would need some lookup components and include logic to manage technical fields such as valid_from_dt, valid_to_dt and row_is_current.

What is most interesting about this design pattern is that it makes the hashing code reusable by hiding all the hashing logic inside a SQL function which dynamically builds the code. It also makes the CDC logic much simpler and uses the power of the SQL engine to bulk update the hashing codes for all rows.

Next Steps

In the case of more complex logic such as a type 2 dimension table being fed by multiple source tables, the pattern would be slightly different. In a nutshell, the ETL would involve the following steps:
1. First data flow - business logic: Use a data source with a SQL command on your multiple source tables, and apply all business rules. Insert all records to a working table.
2. SQL Task - hashing logic: Bulk update the checksum field for every record in the table, using the method described in this article.
3. Second data flow - dimension logic: Use the working table as source and apply the normal logic you would apply to a type 2 dimension using the business key and the checksum field.
  
  (Note: ETL logic for a type 2 dimension is outside the scope of this article, but is very well documented in many books and on the web.)

About the author

Chris Allaire has been developing applications on SQL Server for the last 12 years and specializes in SSIS and data integration.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2012-07-19

Friday, September 24, 2021 - 3:26:25 PM - Phillip Putzback	Back To Top (89264)
I ran into an issue when columns had spaces in the name. I updated the function to add brackets [ ] and it is working.

Thursday, July 19, 2012 - 10:06:58 AM - Christian Allaire	Back To Top (18652)
Thanks Jonathan for you comment. Of course it is more convenient to use the Multiple Hash add-in. However it is not always possible to do so, depending on the environement you are working in. Some companies will not allow the use of third-party components. The point of the article is to present an alternative way of tracking change.

Thursday, July 19, 2012 - 12:21:28 AM - Jonathan Kehayias	Back To Top (18638)
The SSIS Multiple Hash addin, which is free on Codeplex (http://ssismhash.codeplex.com/) would allow the hash to be calculated on the fly in the SSIS package using multiple threads so that you don't have the extra steps, and in my experience is much faster at computing this kind of hash. The output can also be indexed in binary form allowing for very fast lookups against the hash to find deltas.