By: Tim Smith | Updated: 2017-06-23 | Comments | Related: More > Import and Export
Problem
Most of our ETL tasks cleanse data after we import data into our staging table and sometimes these cleansing tasks consume too much time and disrupt our flow. When should we clean data and what are some considerations for this step, as we add new data sources to our environment?
Solution
For an appropriate solution, we have at least three possibilities here regarding "bad" data:
- Some data don't fit the data schema and we would prefer to avoid importing them.
- Some data are invalid records that we cannot use for analysis.
- Some data may be invalid data for some analysis, but provide use.
For this solution, we will look at the first two cases since the latter case means that we should avoid removing it if it provides use.
In the case where we have data that we don't want to import because it doesn't fit the data schema, one way we can save time is to pre-validate each record before entering into a database or a data table that we will bulk copy into the database. To demonstrate how this will work, let's create a CSV file with two bad records:
2017-04-30 22:36:41,Input31,32.51 2017-04-30 22:35:07,Input32,38.60 2017-04-30 22:29:01,Input33,0.0001 2017-04-19 00:29:17,Input34,12.5478 OOPS,Input35,0.0000 2017-04-25 10:00:44,Input36,1358.145 2017-04-29 00:29:17,Input37,DogsAreFunny 2017-05-04 10:29:17,Input38,14.17821 2017-05-06 11:07:14,Input39,17821
Next, we'll test the functionality of declaring an object to save each line after we split the lines by comma into an array. We can see from these data that the first item in the array ([0]) should be a datetime, the second item in the array ([1]) should be a string and the third item in the array ([2]) should be a decimal. In the below PowerShell script, we have the T-SQL ready, but we aren't directly inserting this as we are only testing that this will remove data that don't fit our pre-defined schema:
$nl = [Environment]::NewLine $readline = Get-Content "C:\Files\commafile.csv" if(!(Test-Path "C:\Files\commafile.log")) { $logfile = New-Item -Path "C:\Files\commafile.log" -ItemType File } else { Remove-Item "C:\Files\commafile.log" $logfile = New-Item -Path "C:\Files\commafile.log" -ItemType File } [int]$i = 1 [int]$errorcount = 0 foreach ($line in $readline) { try { $linearray = $line.Split(",") ### Fails before the insert can happen: [datetime]$date = $linearray[0] [string]$key = $linearray[1] [decimal]$value = $linearray[2] $tsql = "INSERT INTO tblTest VALUES ($date,'$key',$value)" Write-Host $tsql } catch [Exception] { $saveerror = "Line number: $i" + $nl $saveerror += $line + $nl $saveerror += $_.Exception.Message + $nl Add-Content $logfile $saveerror $errorcount++; } $i++; } Write-Host "Total errors: $errorcount" Add-Content $logfile "Total errors: $errorcount"
When we run this test on our test file, we can see the output and that 2 records are invalidated. These records with details are saved to the log file as well, specifying the line number, the line data, and the error. The final line of the file specifies the total errors.
Let's review this approach:
- We create a predefined schema for the data by using object of certain types - in this case, datetime, string and decimal.
- We attempt to convert each data value on each line into its appropriate type.
- We capture the data conversion failures for tracking purposes.
Would this be faster than importing the data into a SQL Server varchar staging table and cleaning data there? It depends. In the above code, we're using the built-in function Add-Content. If I was parsing a file of several gigabytes or more, I would be using StreamReader instead as it's much faster. In addition, I might consider creating a data table and read each data value on each row into columns in the data table with a final bulk copy. Both approaches will work, but the latter might be much faster with a larger data set.
What I have found is that on very large data sets, .NET will perform much faster on the object level (in this case, the line and data values on the line) than using T-SQL on a varchar staging table, since the import will pass data values that match the table's schema.
Another common import approach is this automatic import using PowerShell and T-SQL. The data are imported into a staging table of VARCHAR values and these values are transformed after the import step into their appropriate types. Depending on our environment and the amount of data, this can be a useful technique, or we may prefer the above method. Below are some considerations.
If we have data scaled horizontally by table with each table holding a small amount of data, this may save us a more time since we can use the built-in tools by T-SQL for parsing or handling exceptional data (see example below this). These functions execute incredibly slowly on large data sets, but on smaller data sets, their performance may not interfere with the rest of our ETL process.
Similar to the above point, if we have many flat files to import (thousands) and these data sets are small, this may be a faster set up.
If we have data that seldom have data errors, this approach may be best overall. I've worked with data vendors who never provide bad data, or if the file fails, I can always submit it to them and they'll re-issue the file. In these situations, one procedure can follow the import to convert all the data since I seldom see data issues. The "cleanse" in this case is the vendor re-submits the data.
In the below code, we use the TRY_PARSE function in T-SQL to replace invalid dates and integers with NULL values and on smaller data sets this functions well. Because we have a few records here (10,004), these try-parses execute quickly (less than a second). However, if we had more data and more potential for bad data, these might take hours. I've seen try-parses execute for hours on a table with millions of records and hundreds of columns.
CREATE TABLE test_BadData( TestDate VARCHAR(20), TestInt VARCHAR(20) ) GO ---- Execute this separately and after the above step, otherwise it will try to create 10000 tables ----with the same name INSERT INTO test_BadData VALUES ('2017-01-01',100) GO 10000 INSERT INTO test_BadData VALUES ('NotADate',100) INSERT INTO test_BadData VALUES ('22.33',100) INSERT INTO test_BadData VALUES ('2017-01-01','NotAnInt') INSERT INTO test_BadData VALUES ('2017-01-01','2017-01-01') ---- Bad values become NULLs SELECT TRY_PARSE(TestDate AS DATE) AS TestDate , TRY_PARSE(TestInt AS INT) AS TestInt FROM test_BadData
Both methods from pre-cleansing data before importing the data to cleansing data after import will function in certain circumstances. When considering either, in order of importance, I would consider automation, speed and performance in most environments (if performance is a bottleneck, it would come first).
Next Steps
- When considering data cleansing, start with what makes a bad record. From there, we'll know some of the best points for data cleansing.
- If performance is a major concern and the data set is large, considering cleansing the data prior to import.
- If data sets are small or can be scaled, consider data cleansing post import.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2017-06-23