By: Hadi Fadlallah | Updated: 2024-07-17 | Comments | Related: More > Integration Services Development
Problem
Microsoft SQL Server Integration Services (SSIS) is a very important Extract, Transform, and Load (ETL) tool widely used in corporate data management and business intelligence. This ETL tool provides a powerful platform for coordinating and automating data workflows, including operational tasks, such as logging and scheduling, fault-tolerant error handling, support for big data sets scaling, and user-friendliness through its intuitive graphical interface. Built specifically for corporate data pipelines and analytics, SSIS gives data scientists the power to create, run, and monitor complicated processes of integrating information into one chain effectively.
Solution
The goal of this tutorial is to enable ETL developers to obtain practical knowledge to exploit SSIS in transforming and combining data through hands-on exploration to be ready for real-world challenges in managing information.
We'll examine and discuss some of the most essential components and functionalities. We will begin by reviewing the Microsoft SSIS interface and ways to navigate it. We will then address connection managers and set up connections to various data sources. Also, we will explore what makes control flow tasks different from data flow tasks and how they fit into data workflows.
In this demonstration, we'll import a CSV file into SQL Server, including derived column creation, conditional splitting, and the need for data conversion. Immediately after, we'll look at variables and expressions in SSIS for dynamic data processing. Additionally, we will have another demonstration where multiple CSV files are looped over until they reach SQL Server to show how well it handles iterative copies of data via SSIS.
Creating an SSIS Project
First, we must create a new "Integration Services Project" using Visual Studio. Make sure you have already installed the SSIS extension from Visual Studio Market Place:
- Visual Studio 2019: SQL Server Integration Services Projects - Visual Studio Marketplace
- Visual Studio 2022: SQL Server Integration Services Projects 2022 - Visual Studio Marketplace
Next, we will configure the project name and directory.
SSIS Development Interface
The integration services development interface will appear once the SSIS project is created using Visual Studio. As shown in the screenshot below, it is mainly composed of four areas:
- SSIS Toolbox
- Package Designer (five tabs)
- Control Flow
- Data Flow
- Parameters
- Event Handlers
- Package Explorer
- Connection Managers
- Properties
More tools and options are available. However, they are outside the scope of this beginner-level tutorial.
Connection Managers
Since data is the main component of any data integration and analysis solution, the first thing to do in the ETL package is to define all data connections using connection managers. SSIS supports a wide variety of connection managers. For this tutorial, we will use an OLE DB connection to connect to an SQL Server database and a Flat File connection to connect to CSV files.
We will start by adding a Flat File connection to read from a source CSV file. Right-click in the connection managers panel and click "New Flat File Connection…"
Side Note: If you are interested in checking all available connection managers, click on the "New Connection" menu item to select from an extended list.
The Flat File connection manager editor will appear. Let's first change the connection manager name to Source File.
Now, we need to specify the source file path by clicking on "Browse..." Once we select a file, SSIS tries to detect the file metadata: encoding, text qualifier, row delimiter, etc. Sometimes, we need to edit those values manually as SSIS may not detect them accurately. After selecting the file, a warning appears mentioning that "Columns are not defined…"
Click on the Columns tab to let the connection manager detect the existing columns tab based on the first rows in the flat file.
If these columns are not defined accurately or their data types need to be changed, they can be edited manually in the Advanced tab.
After defining the first data connection, click OK.
Now, let's define the SQL Server data connection. We need to create an OLE DB connection manager.
In the Connection Manager editor, specify the SQL Server instance, the authentication parameters, and the database name. Other configurations are available in the All tab.
To rename the OLE DB connection manager, right-click on its icon in the connection manager area and click Rename.
Adding Tasks from the SSIS Toolbox
After adding and configuring the connection managers, we need to design our package. SSIS supports a wide variety of tasks, which can be found in the SSIS toolbox along with a brief description of each task.
Let's begin by adding an "Execute SQL Task," which executes SQL statements using a connection manager that points to a relational data source.
Double-click on the Execute SQL Task icon or drag and drop it within the control flow area. Next, double-click on the Execute SQL Task component in the control flow to open the editor.
In the Execute SQL Task Editor, set the connection type to "OLE DB" since we are using an OLE DB connection manager. Then, select the connection manager we previously created.
Next, add our SQL statement in the "SQLStatement" property, as shown below.
CREATE TABLE [InternationalSales] ( [ID] INT IDENTITY(1,1) PRIMARY KEY, [ProductID] INT, [Date] Date, [Zip] varchar(50), [Units] INT, [Revenue] NUMERIC(18,6), [Country] varchar(50), [SourceFileName] nvarchar(4000))
Click OK.
Click OK to finish configuring the task.
Now, to test this task, right-click on it and click "Execute Task" to execute it separately.
A green icon appears on the upper-right corner of the task to show that it was executed successfully.
More details about the task execution can be seen under the Progress tab that appears after executing the task.
Now, click the Stop button to exit debug mode.
We can rename the task by right-clicking on it and clicking the "Rename" menu item.
Also, a task can be disabled by clicking on the "Disable" menu item.
Let's try to execute the SQL task for a second time. It fails, as shown below.
To check the cause of the failure, you can navigate to the Progress bar and read the package log.
Since the error message is long and truncated, right-click on the error line and click on "Copy Message Text."
Now, we can paste the error message into a new Notepad window to read the full error message.
It looks like the task failed because the table was already created. To avoid this error, we can edit the SQL statement to handle this situation, i.e., create the table if it does not exist, and truncate it if it exists.
IF NOT EXISTS ( SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'dbo' AND TABLE_NAME = 'InternationalSales') CREATE TABLE [InternationalSales] ( [ID] INT IDENTITY(1,1) PRIMARY KEY, [ProductID] INT, [Date] Date, [Zip] varchar(50), [Units] INT, [Revenue] NUMERIC(18,6), [Country] varchar(50), [SourceFileName] nvarchar(4000) ) ELSE TRUNCATE TABLE [dbo].[InternationalSales]
Working with Data Flows
Control Flow manages the workflow and execution order of tasks within an SSIS package. It supports executing command line operators, web services, XML tasks, file system operations, and more tasks. A data flow handles the extraction, transformation, and loading of data within an SSIS package. A package has only one control flow and can have multiple data flows.
For instance, we can create a destination table in the control flow and use a data flow to load the data into it.
Let's add a "Data Flow Task" from the SSIS toolbox.
Click on the "Execute SQL Task" and drag the arrow into the new "Data Flow Task."
This will make both tasks connected and execute sequentially. This connector is called a "Precedence Constraint." We can configure it by double-clicking on the connector.
For example, let's change the precedence constraint to completion. This will tell the package to execute the data flow task even if the precedent "Execute SQL Task" fails.
Connectors example with "completion" label.
Designing the Data Flow
After adding the data flow, double-click on the "Data Flow Task" component to navigate to the "Data Flow" area. Note that the SSIS toolbox changes at this level.
To read data from the Flat File connection manager, we need to add a "Flat File Source" component and select the connection manager.
Then, click on the "Columns" tab to read and define the component metadata as well as configure the output columns; we can ignore some columns or assign others their aliases.
Moreover, we can add a column that contains the source file path. After closing the editor, right-click on the "Flat File Source" and click on "Show Advanced Editor…"
Navigate to the "Component Properties" tab. Assign a name to the "FileNameColumnName" property. In this tutorial, we will name it "SourceFileName."
Now, when we recheck the input columns, we can see that a new column has been added.
Data Transformation
After reading the data from the flat file, we need to convert the columns' data types from text to the appropriate types. The search bar located at the top can be used to search SSIS toolbox items. First, add a "Data Conversion" component.
Now, let's connect the "Flat File" component to it.
Double-click to open the data conversion editor and check three columns:
- ProductID (Output Alias: INT_ProductID, Type: DT_I4)
- Units (Output Alias: INT_Units, Type: DT_I4)
- Revenue (Output Alias: NUM_Revenue, Type: DT_NUMERIC, Precision: 18, Scale: 6)
Now, let's add a "Derived Column" component to apply some expression to the data.
Let's connect the output of the Data Conversion to the Derived Column component.
We need to add two derived columns transformation:
- Add a new column DT_DATE using the following expression: (DT_DBDATE)[Date]
- Replace the "SourceFileName" column with the following expression: RIGHT(SourceFileName,FINDSTRING(REVERSE(SourceFileName),"\\",1) - 1)
Now, let's add a "Conditional Split" component to ignore all records having a date before "01 January 2020."
We need to connect the derived column output to the conditional split component input.
Double-click to open the conditional split editor. We need to define our filter condition: YEAR([DT_DATE]) < 2020.
We need to edit the output names. The filter we added will be named "Old records," and the default output name will be changed to "New records" since it will output the records that do not match the filters defined in the conditional split component.
Finally, let's add an "OLE DB Destination" component to insert the data into the destination table.
Once we connect the conditional split output to the OLE DB destination, a prompt will appear to select the component output that will be linked to the OLE DB destination component. In this case, select "New records."
Note that after connecting the components, we can still use the other output, "Old records," and link it to another component, if needed.
Double-click the OLE DB destination component to select the connection manager and destination table, as well as configure the data load options.
Next, configure the column mapping. Navigate to the "Mappings" tab and map the input columns to the SQL destination columns, as shown in the image below.
Adding Data Viewers
To monitor the data during the package execution, double-click on any data flow path (connectors) and enable the data viewer to select the columns to be monitored.
After executing the package, a window appears showing the data. Click on the Play button to continue fetching the data or the Detach button to stop monitoring the data and continue the package execution.
After executing the package, check the "Progress" tab to read the package execution log.
Inserting Multiple Flat Files into the Same Table
Consider that we need to loop over several flat files stored within a directory with the same structure. Go back to the control flow and add a "ForEach Loop Container." Move the data flow task into the container (as shown in the screenshot below) and connect the execute SQL task to the container instead of the data flow task.
Working with Variables and Expressions
Right-click in the control flow area and click on "Variables."
SSIS variables store values that can be used and updated during the execution of an SSIS package, allowing for dynamic control and configuration of tasks and data flows. In the Variables windows, click on "Add variable."
Add a new string variable named SourceFilePath and set the default value to the "Australia.csv" file.
Now, double-click on the ForEach Loop container. Change the Enumerator type to "Foreach File Enumerator." Select the Source directory and change the Files filter to *.csv to only read the CSV files. Make sure that file names are retrieved in a fully qualified format (full path).
Next, go to the Variable Mappings tab and map the file name retrieved by each loop to the "SourceFilePath" variable we created previously. This allows the full path of each file to be read using this variable on each iteration.
Now, we should configure the flat file connection manager to read the input files. Click on the "Source File" connection manager. In the Properties tab, click on the button in the Expressions property.
Add a new expression to the "ConnectionString" property as follows: @[User::SourceFilePath]
This will make the flat file connection manager link to the file path stored within the SourceFilePath variable.
Let's try to execute the package. Note that the data flow task is executed several times.
After the package execution finishes, go to the Progress tab and note that several files were manipulated.
Checking the Package Structure
To check how the SSIS package is structured, click on the "Package Explorer" tab to see the package tasks and components tree.
Finally, we can run the following query from SSMS to check how many records were imported from each file into the destination table.
SELECT [SourceFileName], COUNT(*) as RecordCount FROM [Test].[dbo].[InternationalSales] GROUP BY [SourceFileName]
We can note that six files were mentioned in the table.
Next Steps
- SQL Server Integration Services SSIS 2016 Tutorial
- SQL Server Integration Services Development Tips, Tutorials, Webinars and More
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2024-07-17