How To Implement Batch Processing in SQL Server Integration Services

By:   |   Updated: 2008-05-22   |   Comments (14)   |   Related: More > Integration Services Development


Problem

We routinely load data warehouses with multiple years worth of fact rows at a time. We'd like to perform this process in batches and be able to restart at the point of failure when an error occurs. Can you give us an example of how we might implement this batching capability in an SSIS package?

Solution

SSIS supports batch processing very nicely with the existing components in the Toolbox. A simple approach to implementing batch processing in SSIS is to come up with a way to group the rows to be processed into batches, process each batch, then update each group as processed. Let's begin by describing a scenario then implement an SSIS package to perform the work.

A common requirement in developing reporting applications is to aggregate data allowing report queries to run very quickly. Let's assume that we want to aggregate data by month. We also want to have the capability to make adjustments to the aggregated data and only recalculate the month(s) that have adjustments.

We can envision an SSIS package with the following steps:

control flow
  • Get Batch List is an Execute SQL task that groups the source data to be processed into batches, creating a result set that contains a single row per batch
  • Process Batch Loop is a Foreach Loop container that iterates over the result set rows; i.e. executes once for each row in the result set
  • Transaction Container is a Sequence container that contains the tasks to be executed for each iteration of the loop; it controls the transaction used to commit if successful or rollback on error
  • Append Batch to Sales History is an Execute SQL task that extracts a batch of rows and inserts them to a history table
  • Compute Aggregation is an Execute SQL task that performs the aggregations on the batch and updates an aggregation table
  • Mark Batch as Processed is an Execute SQL task that updates rows in the source table to indicate that they have been processed

In the following sections we will discuss each step in the SSIS package in detail. Let's begin with the setup then proceed through the steps.

Setup

For simplicity sake we'll get our source data from the AdventureWorks sample database that comes with SQL Server 2005. Use the following script to copy the SalesOrderHeader and SalesOrderDetail tables from AdventureWorks into a database called mssqltips (create this database if it doesn't exist):

USE mssqltips
GO

SELECT *
INTO dbo.imp_SalesOrderHeader
FROM AdventureWorks.Sales.SalesOrderHeader

SELECT *
INTO dbo.imp_SalesOrderDetail
FROM AdventureWorks.Sales.SalesOrderDetail

ALTER TABLE dbo.imp_SalesOrderHeader
ADD Processed bit not null default 0
GO

The Processed column will be updated to 1 as the rows are processed.

In the SSIS package the following variables will be used:

variables

We'll describe the variable usage in the sections below.

Get Batch List

Get Batch List executes a stored procedure that groups the source data into batches. While there are many ways to accomplish this task, in this case we simply group on year and month in the stored procedure stp_CreateOrderBatchList:

SELECT 
  DATEPART(YYYY,OrderDate) OrderYear
 ,DATEPART(MONTH,OrderDate) OrderMonth
FROM dbo.imp_SalesOrderHeader
WHERE Processed = 0
GROUP BY 
  DATEPART(YYYY,OrderDate)
 ,DATEPART(MONTH,OrderDate)
ORDER BY 
  DATEPART(YYYY,OrderDate)
 ,DATEPART(MONTH,OrderDate)

Note that the stored procedure only gets rows where the Processed column is equal to zero.

The Execute SQL task in our SSIS package executes the above stored procedure. The General properties page is shown below:

GetBatchList General

Note the ResultSet property is set to Full result set. The stored procedure just does a SELECT and the Execute SQL task stores the result set in a package variable. The Result Set properties page maps the result set to the package variable User::v_BatchList. The variable type must be System.Object. The Result Name of 0 (zero) is required.

GetBatchList ResultSet

Process Batch Loop

Process Batch Loop is a Foreach Loop container that iterates over the result set created in Get Batch List, one time for each row in the result set. There are two property pages to be configured - Collection and Variable Mappings. The Collection property page has the following settings:

foreach loop editor

In order to iterate through the result set created by Get Batch List, the Enumerator is set to Foreach ADO Enumerator and the ADO object source variable is set to User::v_BatchList. Get Batch List mapped the User::v_BatchList variable to the result set. The Enumeration mode is set to Rows in the first table (there is only one table in the result set).

The Variable Mappings property page has the following settings:

foreach loop mappings

The stored procedure executed in Get Batch List returns a result set that has two columns - OrderYear and OrderMonth. The Variable Mappings property page maps the columns in each row of the result set to the package variables based on the ordinal position of the column (the first column is 0).

Transaction Container

The Transaction Container is a Sequence container. The tasks inside of the container are all executed in a transaction. They either all succeed and are committed or they are rolled back on error. Set the TransactionOption property of the Sequence container to Required; this setting executes all tasks inside the container in the context of a transaction. A new transaction is created each time through the loop.

Append Batch to Sales History

Append Batch to Sales History is an Execute SQL task that calls a stored procedure to extract a single batch of data from the source table and append it to the sales history table. If transformations were required we would use a Data Flow task. The sales history table and stored procedure are as follows:

CREATE TABLE dbo.SalesHistory (
  OrderYear int not null,
  OrderMonth int not null,
  ProductID int not null,
  OrderQty smallint not null,
  LineTotal money not null
)
CREATE PROCEDURE dbo.stp_AppendSalesHistory 
 @OrderYear int
,@OrderMonth int
AS
BEGIN
  SET NOCOUNT ON;

  INSERT INTO dbo.SalesHistory (
    OrderYear
   ,OrderMonth
   ,ProductID
   ,OrderQty
   ,LineTotal 
  )
  SELECT
    DATEPART(YYYY,m.OrderDate)
   ,DATEPART(MONTH,m.OrderDate)
   ,d.ProductID
   ,d.OrderQty
   ,d.LineTotal
  FROM dbo.imp_SalesOrderHeader m
  JOIN dbo.imp_SalesOrderDetail d ON d.SalesOrderID = m.SalesOrderID
  WHERE Processed = 0 
    AND DATEPART(YYYY,m.OrderDate) = @OrderYear
    AND DATEPART(MONTH,m.OrderDate) = @OrderMonth
END
GO

Note that the stored procedure only gets rows where the Processed column is equal to zero.

The General property settings for the Execute SQL task are as follows:

append general

The Parameter Mapping property settings for the Execute SQL task are as follows:

append mappings

In the above settings the SQLStatement is set to execute the stored procedure, with placeholders for the required parameters. SSIS package variables are mapped to the parameters based on the ordinal number of the parameters in the stored procedure.

Compute Aggregation

Compute Aggregation is an Execute SQL task that recalculates the summary data in the sales history summary table for the order year and order month batch being processed. The sales history summary table and stored procedure are as follows:

CREATE TABLE dbo.SalesHistorySummary (
  OrderYear int not null,
  OrderMonth int not null,
  ProductID int not null,
  OrderQty smallint not null,
  LineTotal money not null
)
CREATE PROCEDURE dbo.stp_CalcSalesHistorySummary 
 @OrderYear int
,@OrderMonth int
AS
BEGIN
  SET NOCOUNT ON;

  DELETE FROM dbo.SalesHistorySummary
  WHERE OrderYear = @OrderYear
    AND OrderMonth = @OrderMonth;

  INSERT INTO dbo.SalesHistorySummary (
    OrderYear
   ,OrderMonth
   ,ProductID
   ,OrderQty
   ,LineTotal 
  )
  SELECT
    OrderYear
   ,OrderMonth
   ,ProductID
   ,SUM(OrderQty)
   ,SUM(LineTotal)
  FROM dbo.SalesHistory
  WHERE OrderYear = @OrderYear
    AND OrderMonth = @OrderMonth
  GROUP BY 
    OrderYear
   ,OrderMonth
   ,ProductID
END
GO

The above stored procedure first deletes any rows in the summary table for the order year and month being processed, then performs the aggregation and insert. The Execute SQL task property settings are the same as in Append Batch to Sales History except for the name of the stored procedure to execute; we'll skip showing the screen shots.

Mark Batch as Processed

Mark Batch as Processed is an Execute SQL task that updates the Processed column in the source table for the rows that have been processed in the current batch. It invokes the following stored procedure:

CREATE PROCEDURE dbo.stp_MarkOrdersProcessed 
 @OrderYear int
,@OrderMonth int
AS
BEGIN
  SET NOCOUNT ON;
 
  UPDATE dbo.imp_SalesOrderHeader
  SET Processed = 1
  WHERE DATEPART(YYYY,OrderDate) = @OrderYear
    AND DATEPART(MONTH,OrderDate) = @OrderMonth;
END
GO

The Execute SQL task property settings are also the same as before except for the name of the stored procedure; we will skip the screen shots.

Summary

Let's highlight the key points in our sample SSIS package that implements batch processing:

  • Group the source data into batches; use the Execute SQL task and create a Full result set which saves the result set in a package variable.
  • Iterate over the result set using a Foreach Loop Container.
  • Use a Sequence Container to define a transaction and add the appropriate tasks inside the Sequence Container.
  • The package design supports restarting the package at the beginning in the event of any error; the Sequence Container commits all work if successful or does a rollback if there are any errors. The source data is flagged when processed so it will not be processed again if the package is restarted.
Next Steps
  • Download the sample code and experiment with implementing batch processing in an SSIS package. The sample code was tested using SQL Server 2005 Developer Edition with SP2.
  • The Sequence container transaction makes use of the Microsoft Distributed Transaction Coordinator (MSDTC) service. Although this service is on by default, make sure that it is running in your environment. You can check the status of the MSDTC service under Services; the name of the service is Distributed Transaction Coordinator.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Ray Barley Ray Barley is a Principal Architect at IT Resource Partners and a MSSQLTips.com BI Expert.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2008-05-22

Comments For This Article




Tuesday, March 19, 2024 - 3:32:16 PM - Swetha Back To Top (92094)
Thank you - would u be able to provide best approach
an example where we have load over billion row worth data load every day

Friday, March 15, 2024 - 5:07:48 PM - RAYMOND H BARLEY JR Back To Top (92077)
I've never worked with a batch size of 1 billion rows. I would expect you would have to significantly alter my example, particularly the Get Batch List which does a GROUP BY on DATEPART(YYYY,OrderDate) and ,DATEPART(MONTH,OrderDate). For instance every row in your batch might be the same year and month which would not work as I intended.

Thursday, March 14, 2024 - 10:50:25 AM - swetha sachdeva Back To Top (92071)
what if the each batch contains over billion rows every time .
will this approach work ? what are the limitations.

Friday, May 9, 2014 - 1:03:03 AM - CoolDbGuy Back To Top (30691)

Thanks Ray for the solution, 

Since the transaction happens after each row. Don't you think it'll be performance hit when we have huge amount of records and each row woulbe going through the transaction.

 

Thanks

CoolDbGuy


Thursday, January 17, 2013 - 4:59:01 PM - CK Back To Top (21536)

Thanks, Ray. Well Presented. Works fine in 2008R2 without any problems.


Thursday, October 11, 2012 - 10:35:13 AM - Ray Barley Back To Top (19878)

yes you should add and Process = 0


Thursday, October 11, 2012 - 2:59:38 AM - Thura Back To Top (19863)
Dear Ray,
 
Thank you for your article. It really helps me. Juz to clarify that in your last script:
CREATE PROCEDURE dbo.stp_MarkOrdersProcessed 
 @OrderYear int
,@OrderMonth int
AS
BEGIN
 SET NOCOUNT ON;
 UPDATE dbo.imp_SalesOrderHeader
 SET Processed = 1
 WHERE DATEPART(YYYY,OrderDate) = @OrderYear
 AND DATEPART(MONTH,OrderDate) = @OrderMonth;
END
GO
 
Should it include "AND Processed = 0" filter?

Friday, July 13, 2012 - 7:10:21 PM - Dan P Back To Top (18509)

Ray, thanks for posting this logical and helpful tip. your examples make it clear to understand SSIS batch processing.


Wednesday, June 30, 2010 - 3:57:51 PM - jgolde05 Back To Top (5764)

I just wanted to thank you for posting.  It was the most complete explanation of how to setup the process that I have found.  It still took me some time to fumble through and re-read what I missed several times but in the end it worked.

I am using a slightly different variation in that I am using a DELETE ... OUTPUT INTO script instead of an update.  Works like a champ. Thanks.


Monday, January 11, 2010 - 6:00:36 AM - raybarley Back To Top (4697)

If you have loaded the data into a full resultset object, I think your only choice would be to add a script component where you write VB.NET code to insert the data into another table.  Rather than do that you should be able to use a data flow component which will do exactly what you want.

Here is a reference to the SSIS tutorial page that covers the data flow component: http://www.mssqltips.com/tutorial.asp?tutorial=211

 

 


Monday, January 11, 2010 - 4:01:55 AM - 1974 Back To Top (4696)

Hi all,

         This is really a very good post. I would like to have little more help from you. I execute sql load all the data from a table in a Full Resultset object. Now i want to insert all those retrieved data into another table, how can I do that? Please help me on this.

 

Thanks a lot in advance


Sunday, January 25, 2009 - 4:43:30 PM - raybarley Back To Top (2608)

Here is the connection string for the mssqltips database in the SSIS package:

Data Source=localhost;Initial Catalog=mssqltips;Provider=SQLNCLI.1;Integrated Security=SSPI;Auto Translate=False;

This assumes that the database is on your local machine and is the default instance.  Is your database on a different machine or is it a named instance? 

 


Friday, January 23, 2009 - 6:43:55 AM - admin Back To Top (2598)

Did you create a "mssqltips" database?

If not create a new database called "mssqltips" and then run the "scripts.sql" code in the zip file to create all of the objects.

See if that works.


Thursday, January 22, 2009 - 8:57:55 PM - yych13 Back To Top (2595)

Hi,I just downloaded the sample code , but when I ran it in visual studio 2005, an error was reported as follows: 

[Execute SQL Task] Error: Executing the query "EXEC dbo.stp_CreateOrderBatchList" failed with the following error: "Could not find stored procedure 'dbo.stp_CreateOrderBatchList'.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.

 

I am a rookie in the field of  SSIS, Coulde u tell me the reason and how to solve the problem?

Environment:

XP sp3 + SQLServer 2005 +VS 2005

AdventureWorks has been installed and data was imported into DB.

ran the sql script and then pressed F5 to execute the sample project in VS2005

USE mssqltips
GO
SELECT *
INTO dbo.imp_SalesOrderHeader
FROM AdventureWorks.Sales.SalesOrderHeader
SELECT *
INTO dbo.imp_SalesOrderDetail
FROM AdventureWorks.Sales.SalesOrderDetail
ALTER TABLE dbo.imp_SalesOrderHeader
ADD Processed bit not null default 0
GO

 















get free sql tips
agree to terms