By: Nat Sundar | Updated: 2016-12-20 | Comments (2) | Related: More > Integration Services Development
Problem
When developing SSIS packages, sometimes it is easier to work with a sample dataset instead of the entire dataset. In this tip we look at a couple of different ways to do data sampling transformations in SSIS.
Solution
There are two transformations in SSIS that will help us generate a smaller subset of data based on sampling. They are row sampling and percentage sampling transformations.
Usage of SSIS sampling transformation
- Instead of using the full production dataset, a sample can be used to test the package multiple times for different conditions. This can be very helpful to replicate production bugs on a development server.
- Based on the environment capacity (Dev, Test & UAT servers), the test data size can be adjusted easily.
- Allows you to test data mining models with different test data.
- Allows for data profiling activities – to understand the quality of the source data.
SSIS Row Sampling Transformation
The SSIS Row Sampling transformation generates a sample dataset based on a random selection from the input dataset. The expected number of sample records must be entered in the transformation. By default the transformation generates a random number to select the input rows. So every time you run the transformation you may a get different set of records. This is very useful, as you may want to test the data flow for different sets of representative data.
However if you prefer to have the same set of data every time, then you can define a seed. This can be done by enabling the option "Use the following random seed".
This transformation generates two outputs. The "sampling selected output" will provide the sample dataset based on the number of rows keyed in and the rest of the records will be available thru the output "Sampling unselected output".
The sampling seed and the sampling value properties can also be set using the properties window.
In the below image, we are extracting data from Fact InternetSales and would like to sample 100 records. The row sampling transformation has been added as below.
The sampling value (number of rows) has been set to 100. Also the sampling seed has been set to 1, so that we get the same records each time we this is run.
After successful execution, we can see the sampling records thru the "Sampling Selected Output".
The data viewer shows the sample output.
SSIS Percentage Sampling
The SSIS Percentage Sampling transformation works the same way as the Row Sampling transformation except we provide the percentage of sample records instead of a specific record count.
The Percentage Sampling transformation uses an algorithm to determine whether a row should be included in the sample output. Because of the functionality of the algorithm, the number of output rows will not exactly match the specified percentage. Let's assume we have a source dataset with 10,000 records and you want to inspect the data quality by sampling 10% of the dataset. Ideally you expect to see 1000 records, but the actual records sampled by SSIS may be greater or less than 1000 records.
In the below image, we are extracting data from Fact InternetSales and would like to sample 10% of the records. The percentage sampling transformation has been added as below.
The sampling value (percentage of rows) has been set to 1.
After successful execution, we can see the sampling records thru the "Sampling Selected Output".
Summary
The Sampling transformations has a wide variety of usage. In data mining analysis, both the selected and unselected outputs can be used to design and test the model. Both the row and percentage sampling transformations are easy to learn and configure.
Next Steps
- Read more about Row Sampling
- Read more about Percentage Sampling
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2016-12-20