Data Sampling in SQL Server Integration Services

By:   |   Updated: 2016-12-20   |   Comments (2)   |   Related: More > Integration Services Development


Problem

When developing SSIS packages, sometimes it is easier to work with a sample dataset instead of the entire dataset.  In this tip we look at a couple of different ways to do data sampling transformations in SSIS.

Solution

There are two transformations in SSIS that will help us generate a smaller subset of data based on sampling. They are row sampling and percentage sampling transformations.

Usage of SSIS sampling transformation

  • Instead of using the full production dataset, a sample can be used to test the package multiple times for different conditions. This can be very helpful to replicate production bugs on a development server.
  • Based on the environment capacity (Dev, Test & UAT servers), the test data size can be adjusted easily.
  • Allows you to test data mining models with different test data.
  • Allows for data profiling activities – to understand the quality of the source data.

SSIS Row Sampling Transformation

The SSIS Row Sampling transformation generates a sample dataset based on a random selection from the input dataset. The expected number of sample records must be entered in the transformation. By default the transformation generates a random number to select the input rows. So every time you run the transformation you may a get different set of records. This is very useful, as you may want to test the data flow for different sets of representative data.

However if you prefer to have the same set of data every time, then you can define a seed. This can be done by enabling the option "Use the following random seed".

This transformation generates two outputs. The "sampling selected output" will provide the sample dataset based on the number of rows keyed in and the rest of the records will be available thru the output "Sampling unselected output".

The sampling seed and the sampling value properties can also be set using the properties window.

In the below image, we are extracting data from Fact InternetSales and would like to sample 100 records. The row sampling transformation has been added as below.

SQL Server Integration Services Row Sampling Transformation

The sampling value (number of rows) has been set to 100. Also the sampling seed has been set to 1, so that we get the same records each time we this is run.

SQL Server Integration Services Row Sampling Transformation Editor

After successful execution, we can see the sampling records thru the "Sampling Selected Output".

SQL Server Integration Services Sampling Selected Output

The data viewer shows the sample output.

SQL Server Integration Services Data Viewer

SSIS Percentage Sampling

The SSIS Percentage Sampling transformation works the same way as the Row Sampling transformation except we provide the percentage of sample records instead of a specific record count.

The Percentage Sampling transformation uses an algorithm to determine whether a row should be included in the sample output. Because of the functionality of the algorithm, the number of output rows will not exactly match the specified percentage.  Let's assume we have a source dataset with 10,000 records and you want to inspect the data quality by sampling 10% of the dataset. Ideally you expect to see 1000 records, but the actual records sampled by SSIS may be greater or less than 1000 records.

In the below image, we are extracting data from Fact InternetSales and would like to sample 10% of the records. The percentage sampling transformation has been added as below.

SQL Server Integration Services Percentage Sampling

The sampling value (percentage of rows) has been set to 1.

SQL Server Integration Services Percentage Sampling Transformation Editor

After successful execution, we can see the sampling records thru the "Sampling Selected Output". 

SQL Server Integration Services Sampling Selected Output

Summary

The Sampling transformations has a wide variety of usage. In data mining analysis, both the selected and unselected outputs can be used to design and test the model. Both the row and percentage sampling transformations are easy to learn and configure.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Nat Sundar Nat Sundar is working as an independent SQL BI consultant in the UK with a Bachelors Degree in Engineering.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2016-12-20

Comments For This Article




Thursday, December 22, 2016 - 8:03:03 AM - Nat Sundar Back To Top (45031)

 

 Thanks Javid for the nice comments


Wednesday, December 21, 2016 - 4:28:51 PM - javid khan Back To Top (45026)

 Thnaks Nat for small but very helpful artical.

 Keep good work of sharing.

 

 















get free sql tips
agree to terms