SQL Server Integration Services Performance Best Practices

By:   |   Updated: 2009-10-27   |   Comments (8)   |   Related: 1 | 2 | 3 | 4 | > Integration Services Best Practices


Problem

In the first tip (SQL Server Integration Services (SSIS) - Best Practices - Part 1) of this series I wrote about SSIS design best practices. To continue down that path, this tip is going to cover recommendations related to the SQL Server Destination Adapter, asynchronous transformations, DefaultBufferMaxSize and DefaultBufferMaxRows, BufferTempStoragePath and BLOBTempStoragePath as well as the DelayValidation property.

Solution

In this tip my recommendations are related to different kinds of transformations, impacts for overall SSIS package performance, how memory is managed in SSIS by creating buffers, working with insufficient memory, how SSIS manages spooling when experiencing memory pressure and the significance of the DelayValidation property.

Best Practice #5 - SQL Server Destination Adapter

It is recommended to use the SQL Server Destination adapter, if your target is a local SQL Server database. It provides a similar level of data insertion performance as the Bulk Insert task and provides some additional benefits. With the SQL Server Destination adapter you can transformation the data before uploading it to the destination, which is not possible with Bulk Insert task. Apart from the options which are available with OLEDB destination adapter, you will get some more options with the SQL Server destination adapter as depicted in the images below. For example, you can specify whether the insert triggers on the target table should fire or not. By default this option is set to false which means no triggers on the destination table will fire. Enabling this option may cause an additional performance hit because the triggers need to fire, but the trigger logic may be needed to enforce data or business rules. Additional options include specifying the number of the first/last rows in the input to load, specifying the maximum number of errors which will cause the bulk load operation to be cancelled as well as specifying the insert column sort order which will be used during the upload process.

sqlserver destination

ole db destination

Remember if your SQL Server database is on a remote server, you cannot use SQL Server Destination adapter. Instead use the OLEDB destination adapter. In addition, if it is likely that the destination will change from a local to remote instances or from one SQL Server instance to another, it is better to use the OLEDB destination adapter to minimize future changes.

Best Practice #6 - Avoid asynchronous transformation (such as Sort Transformation) wherever possible

Before I talk about different kinds of transformations and its impact on performance, let me briefly talk about of how SSIS works internally. The SSIS runtime engine executes the package. It executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine. The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance. Now if you are wondering what an execution tree is, then here is the answer.

An execution tree, as name implies, has a similar structure as a tree. It starts at a source or an asynchronous transformation and ends at destination or first asynchronous transformation in the hierarchy. Each execution tree has a set of allocated buffer and scope of these buffers are associated the execution tree. Also each execution tree is allocated an OS thread (worker-thread) and unlike buffers this thread may be shared by any other execution tree, in other words an OS thread might execute one or more execution trees. Click here for more details on Execution Tree. In SSIS 2008, the process of breaking data flow task into an execution tree has been enhanced to create an execution path and sub-path so that your package can take advantage of high-end multi-processor systems. Click here for more details on SSIS 2008 pipeline enhancements.

Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows. Because synchronous transformations output the same number of records as the input, it does not require new buffers (processing is the done in the same incoming buffers i.e. in the same allocated memory) to be created and because of this it is normally faster. For example, in the Derived column transformation, it adds a new column in the each incoming row, but it does not add any additional records to the output.

Unlike synchronous transformations, the asynchronous transformation might output a different number of records than the input requiring new buffers to be created. Because an output is dependent on one or more records it is called a blocking transformation. Depending on the types of blocking it can either be partially blocking or a fully blocking transformation. For example, the Sort Transformation is a fully blocking transformation as it requires all the incoming rows to arrive before processing.

As discussed above, the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that's the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause.

Best Practice #7 - DefaultBufferMaxSize and DefaultBufferMaxRows

As I said in the "Best Practices #6", the execution tree creates buffers for storing incoming rows and performing transformations. So how many buffers does it create? How many rows fit into a single buffer? How does it impact performance?

The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a buffer dependent on few other factors. The first consideration is the estimated row size, which is the sum of the maximum sizes of all the columns from the incoming records. The second consideration is the DefaultBufferMaxSize property of the data flow task. This property specifies the default maximum size of a buffer. The default value is 10 MB and its upper and lower boundaries are constrained by two internal properties of SSIS which are MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a buffer can be as small as 64 KB and as large as 100 MB. The third factor is, DefaultBufferMaxRows which is again a property of data flow task which specifies the default number of rows in a buffer. Its default value is 10000.

Although SSIS does a good job in tuning for these properties in order to create a optimum number of buffers, if the size exceeds the DefaultBufferMaxSize then it reduces the rows in the buffer. For better buffer performance you can do two things. First you can remove unwanted columns from the source and set data type in each column appropriately, especially if your source is flat file. This will enable you to accommodate as many rows as possible in the buffer. Second, if your system has sufficient memory available, you can tune these properties to have a small number of large buffers, which could improve performance. Beware if you change the values of these properties to a point where page spooling (see Best Practices #8) begins, it adversely impacts performance. So before you set a value for these properties, first thoroughly testing in your environment and set the values appropriately.

You can enable logging of the BufferSizeTuning event to learn how many rows a buffer contains and you can monitor "Buffers spooled" performance counter to see if the SSIS has began page spooling. I will talk more about event logging and performance counters in my next tips of this series.

Best Practice #8 - BufferTempStoragePath and BLOBTempStoragePath

If there is a lack of memory resource i.e. Windows triggers a low memory notification event, memory overflow or memory pressure, the incoming records, except BLOBs, will be spooled to the file system by SSIS. The file system location is set by the BufferTempStoragePath of the data flow task. By default its value is blank, in that case the location will be based on the of value of the TEMP/TMP system variable.

Likewise SSIS may choose to write the BLOB data to the file system before sending it to the destination because BLOB data is typically large and cannot be stored in the SSIS buffer. Once again the file system location for the spooling BLOB data is set by the BLOBTempStoragePath property of the data flow task. By default its value is blank. In that case the location will be the value of the TEMP/TMP system variable. As I said, if you don't specify the values for these properties, the values of TEMP and TMP system variables will be considered as locations for spooling. The same information is recorded in the log if you enable logging of the PipelineInitialization event of the data flow task as shown below.

User:PipelineInitialization,ARSHADALI-LAP,FAREAST\arali,Data Flow Task,{C80814F8-51A4-4149-8141-D840C9A81EE7},{D1496B27-9FC7-4760-821E-80285C33E74D},10/11/2009 1:38:10 AM,10/11/2009 1:38:10 AM,0,0x,No temporary BLOB data storage locations were provided. The buffer manager will consider the directories in the TEMP and TMP environment variables.

So far so good. What is important here is to change this default values of the BufferTempStoragePath/BLOBTempStoragePath properties and specify locations where the user executing the package (if the package is being executed by SQL Server Job, then SQL Server Agent service account) has access to these locations. Preferably both locations should refer to separate fast drives (with separate spindles) to maximize I/O throughput and improve performance.

Best Practice #9 - How DelayValidation property can help you

SSIS uses validation to determine if the package could fail at runtime. SSIS uses two types of validation. First is package validation (early validation) which validates the package and all its components before starting the execution of the package. Second SSIS uses component validation (late validation), which validates the components of the package once started.

Let's consider a scenario where the first component of the package creates an object i.e. a temporary table, which is being referenced by the second component of the package. During package validation, the first component has not yet executed, so no object has been created causing a package validation failure when validating the second component. SSIS will throw a validation exception and will not start the package execution. So how will you get this package running in this common scenario?

To help you in this scenario, every component has a DelayValidation (default=FALSE) property. If you set it to TRUE, early validation will be skipped and the component will be validated only at the component level (late validation) which is during package execution.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Arshad Ali Arshad Ali is a SQL and BI Developer focusing on Data Warehousing projects for Microsoft.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2009-10-27

Comments For This Article




Friday, October 11, 2013 - 6:13:54 AM - Srikanth Kota Back To Top (27116)

I am getting the below error message when package failed. But when I look at the data everything looks good for me. If I rerun the job its getting sucess without any changes. Is this something related to buffers? If so can you please make me to understand how this will effect?

Error Message:

SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "Trk_StgHHResume_Count" (1753) failed with error code 0x80004005 while processing input "Row Count Input 1" (1755). The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.

The attempt to add a row to the Data Flow task buffer failed with error code 0xC0047020.

SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on component "Get Yesterday's recs from HHResume" (1420) returned error code 0xC02020C4. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.

 

Thank  you

Srikanth K


Thursday, January 17, 2013 - 2:18:02 PM - Steve Stombaugh Back To Top (21531)

Arshad,

Thanks again for your quick response.  Your example is quite obvious as to the gains of Early validation.  If you were working with a SQL Source and Destination only, and really the only thing that changes is the Server and Database Name,, meaning all Source destinations are in the same database, and all the Target destinations are in the same database, it would be impossible for the 4th DFT to fail, if the first 3 executed.  This assumes that the schemas in Development and Production are always the same.  In my case that is controlled.  Would you then see any benefits to execution speed in leaving off the early validation?

 


Thursday, January 17, 2013 - 12:40:42 PM - Arshad Back To Top (21525)

Hi Steve,

Let me give you one scenario which actually depicts why early validation is essential.

Suppose there is package with 5 DFTs in sequential order, each one takes almost 10 minutes and hence almost 50 minutes for package execution.

Case 1 : When early validation is ON - Now suppose source table being referenced by 4th DFT has changed or could not be connected, in this case as early validation is ON, the package execution will fail in the initial validation phase only and package execution will start at all.

Case 2 : When early validation is OFF - Now suppose source table being referenced by 4th DFT has changed or could not be connected, in this case as early validation is OFF, the package execution will start but when it will reach to the 4th DFT it will fail (after running for almost 30 minutes) which is not desirable in most of the cases.

The intent here is to fix the problem as early as possible at the cost of additional level of validation to ensure it might not fail during execution after wasting time in unnecessary execution. But again, its upto you, based on your requirement you can configure DelayValidation for avoid early validation.

Hope it helps.


Thursday, January 17, 2013 - 9:55:22 AM - Steve Stombaugh Back To Top (21516)

Thank you Arshad,

I have a question related to your best practice #9.  After successful unit testing, would it not be a good idea to always turn off early validation for all the components of a package, in order to speed things up?  Most of the time, the only thing that could effect the execution would be settings in my package configuration, which might hold a source connection and a destination connection, as well as security information.  When thse things are changed, like when moving from development to Production, a connection could fail, but it would fail for every package.  I understand that for a single package, early validation might save the time that it takes for a single package to execute, but it also takes time to validate the same connections redundantly  in each sucessive package.  What are your thoughts?


Tuesday, February 28, 2012 - 5:57:40 AM - Jiri Jakoubek Back To Top (16188)

Hi,

I have only technical note that is related to paragraph Best Practice #8 - BufferTempStoragePath and BLOBTempStoragePath.

Short time ago, I have experienced continual job failure. After some troubleshooting I've realised the problem was with BufferTempStoragePath so I granted rights to the service user of SQL Agent to access C:\Windows\Temp (acconrding to system variables TEMP/TMP) but that didn't help. Why? Becouse property BufferTempStoragePath unfortunately uses USER NOT SYSTEM! TEMP/TMP variables that point to C:\Documents and Settings\UserProfileName\Local Settings\Temp.

This was tested on SQL Server 2005 Enterprise

Thanks.

Jiri


Wednesday, October 28, 2009 - 8:54:02 AM - admin Back To Top (4325)

Arshad,

Thank you for sharing those experiences with the community.  They help to provide additional context to the best practices.

Thank you,
The MSSQLTips Team


Wednesday, October 28, 2009 - 2:35:57 AM - arshad0384 Back To Top (4321)

Well, are you specific about a particular best practice from this list of best practices?

On overall basis, the exact figures vary purely with the package design and different data flow tasks inside it. The factors effecting the performance are the number of rows coming through the pipeline, numbers of columns in it, size of each columns, indexes on source and destination tables, buffers size, commit size, parallel execution, checkpoint restartiblity from previous point of failures, kind of transformations you are using inside DFT and several other factors which I will be discussing in details in this series with mostly practical examples.

Scenario example from Best Practices #1 (where we saw 60%-70%)

Recently we had to pull data from a source table which had 300 millions records to a new target table. Initially when the SSIS package started, everything looked fine, data was being transferred as expected but gradually the performance degraded and the data transfer rate went down dramatically. During analysis we found that the target table had a primary clustered key and two non-clustered keys. Because of the high volume of data inserts into the target table these indexes got fragmented heavily up to 85%-90%. We used the online index rebuilding feature to rebuild/defrag the indexes, but again the fragmentation level was back to 90% after every 15-20 minutes during the load. This whole process of data transfer and parallel online index rebuilds took almost 12-13 hours which was much more than our expected time for data transfer.  Then we came with an approach to make the target table a heap by dropping all the indexes on the target table in the beginning, transfer the data to the heap and on data transfer completion, recreate indexes on the target table. With this approach, the whole process (by dropping indexes, transferring data and recreating indexes) took just 3-4 hours which was what we were expecting.

Scenario example from Best Practices #4 (getting rid of consistently failures by avoiding tempdb full exception)

We had a almost 500 millions records in source table almost 3 years back at that time only we did full pull from this table and since then we had been doing incremental pull. The package had been running fine. But for some reason we needed to do full again but this time the same package failed for several times with tempdb full exception. The reasons were now the number of rows grew to 900 millions and even the server is being shared by some other packages running in parallel. So we applied changes to Effect of Rows Per Batch and Maximum Insert Commit Size Settings.

These two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.

Yes there are some practices which must be used only when required and I have explicitly mentioned about it in the tip. For example event logging (Best Practices #11) helps a lot in diagnosing the failure or performnace problems but it also puts overhead on runtime engine. So you should only log as maximum as possible to identify the problems and as minimum as possible to not hit the performance or even keep it disable and dynamically enable it when required.

Hope it helps, feel free to contact me if you have any doubts.


Tuesday, October 27, 2009 - 4:14:49 PM - admin Back To Top (4315)

Arshad,

Thank you for these best practices.

Based on your experience, about how much, generally speaking, of a performance gain have you seen with these changes?

Are there any times when you do not follow these best practices?

Thank you,
The MSSQLTips Team















get free sql tips
agree to terms