By: Jeffrey Yao | Updated: 2016-12-12 | Comments (4) | Related: > Security
Problem
We usually need to restore our production databases to our QA or other environment for various reasons, but due to data security requirements we need to mask sensitive data, like customer name, addresses, email addresses, phones numbers, etc.
To meet our various requirements for data quality, we want the following rules for our data during the masking transformation:
- A character, i.e. a to z or A to Z, should still be a character in the same case after masking the transaction
- Within the context of the 1st rule, a vowel, i.e. [aeiou] should still be a vowel [aeiou], while a consonant should still be a consonant, after masking.
- A number, i.e. 0 to 9, should still be a number after masking
- A non alphanumeric letter, like space, comma etc., remains as it is without change.
- Some part of a string can be required to not change, for example the first 3 digits (i.e. area code) of the phone number. We may need data like this, so we can calculate geographic statistics.
We do not want to simply update the data to be the same value of 'xxxxx' for all records, because that will compromise our test cases or reporting quality.
Solution
The solution is straight-forward, we need to process each string character by one character and do the transformation according to the rules.
Here is the detailed algorithm design:
- We will create a function that accepts a string as input (with a few other control input parameters) and return a masked string.
- We will create three arrays, one for vowels, one for consonants, and another for numbers
- We will parse each character of the input string and process as per rules. For example, if the character is a vowel, we will generate a random number and use this number as the index for the array of the vowels (character case will be taken into consideration). Similar logic will be used for consonants and numbers.
- If the character is not alphanumeric, there is no change.
This is the core of the algorithm and more will be discussed after all parameters are introduced.
So here is the function signature design:
string RandomizeString (string SourceString, bool KeepDigit, StartNoChange int, StopNoChange int, Direction char) where SourceString: the source string we want to do the conversion KeepDigit: if the source string contains a numeric digits, do we want to keep it as it is or change to other random digit? StartNoChange: we may want to keep a part of the source string unchanged, so this is the start position of that no-change part. This is 0 based EndNochange: we may want to keep a part of the source string unchanged, so this is the end position of that no-change part. Direction: only accepts two values 'L' = counting from Left, 'R'= counting from Right. This is used when we want to keep part of the string unchanged if we want to keep the last 3 characters unchanged, the parameters will be StartChange=0, EndNoChange=2, and Direct = 'R'
So with these parameters, the algorithm needs some more logic.
- If KeepDigit = true, if a character is a numeric digit, no transformation, otherwise, use a random digit to replace it
- If Direction = 'R', we need to re-calculate the StartNoChange and EndNoChange to make sure the same position if we parse from the left, for example, if a string is 'abcdef', if Direction='R', StartNoChange=0 and EndNoChange=2, we want to re-calculate to make StartNoChange=3 and EndNoChange=5 as we are parsing the string from left to right.
- If a character's position in the source string is between StartNoChange and EndNoChange, no transformation is done.
There are two practical implementations, one is to create a CLR function and another is to use a non-CLR function like a PowerShell solution. Each has its own unique strengths.
For the CLR function, the strength is performance and the weakness is maintenance cost and possibly company policy restrictions.
For the PowerShell approach, the strength is flexibility, i.e. you do not need to install anything inside SQL Server instance. The weakness is performance.
I will provide source code for both solutions, so readers can choose according to their needs.
PowerShell Solution
The PowerShell solution can be best used against small tables with tens of thousands records.
<#.Synopsis Mask a string to hide the real value of the string .DESCRIPTION Mask a string by changing each character randomly. .EXAMPLE Randomize-String -SourceData "Hello World, good morning, it is 8:10am!" This will return a masked string like the following Xecke Muwwf, fuot cutjisk, oh ef 7:95ez! .EXAMPLE Randomize-String -SourceData "8:30 am, Hello World, good morning, it is 8:10am, nice2!" -StartNoChange 7 no change should happen starting at the 8th letter (StartNoChange is 0 based), you will get a string similar like 7:08 un, Hello World, good morning, it is 8:10am, nice2! .INPUTS SourceData: The source string to be masked KeepDigit: switch parameter, if present, numeric digits will not be masked. StartNoChange: int value, starting at this position, no masking needed. Default is 0 EndNochange: int value, ending at this position, no masking needed. Default is -1, meaning the end of the string. Direction: can be 'L' or 'R', 'L', 'L' means parsing string from Left. 'R' means parsing string from Right. .OUTPUTS A randomly masked string. #> #requires -version 3.0 function Randomize-String { [CmdletBinding()] [Alias()] [OutputType([String])] Param ( # Param1 help description [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)] [AllowEmptyString()] [string]$SourceData, [Parameter(Mandatory=$false, ParameterSetName='Position')] [ValidateScript( {if ($_ -ge 0) { $true } else {throw "$_ must be 0 or greater"} } )] [int]$StartNoChange=0, [Parameter(Mandatory=$false, ParameterSetName='Position')] [int]$EndNoChange=-1, [Parameter(Mandatory=$false, ParameterSetName='Position' )] [ValidateSet('L', 'R')] [char]$Direction = 'L', #default from left [Parameter(Mandatory=$false )] [switch] $KeepDigit ) Begin { [char[]]$vowels = 'A', 'E', 'I', 'O', 'U', 'a','e', 'i','o','u'; [char[]]$consonants = 'BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz'.ToCharArray(); [char[]]$digits ='0123456789'.ToCharArray(); [string]$target = ""; [int]$len = 0; [bool] $rtn_source = $false; [int]$seed = 0; #processing parameters if (($sourceData -eq $null) -or ($SourceData -eq [System.DBNull]::Value) -or $SourceData -eq '') { $rtn_source = $true; return; } if ($PSBoundParameters.ContainsKey('StartNoChange') ) { $len = $SourceData.Length; if ($EndNoChange -le 0) { if ($Direction -eq 'L') #counting from left to right { $EndNoChange = $len -1; #$NoChange = $SourceData.Substring($StartNoChange); } else { $EndNoChange = $len-$StartNoChange-1; $StartNoChange = 0; } } if ($Direction -eq 'R') #counting from left to right # counting from right to left { $StartNoChange = $len -$StartPostion-1; $EndNoChange = $len - $EndNoChange -1; #swap $StartNoChange, $EndNoChange [int]$tmp = $StartNoChange; $StartNoChange = $EndNoChange ; $EndNoChange = $tmp; } }# para StartNoChange is used } #begin Process { if($rtn_source) { return; } $r = New-Object System.Random; [int]$n=0; [int] $i = 0; [char[]]$ac= $SourceData.ToCharArray(); for ($i=0; $i -lt $ac.length; $i++) { $c = $ac[$i]; if ( ($i -ge $StartNoChange) -and ($i -le $EndNoChange) ) { $target += $c; continue; } if ($c -in $consonants) { $seed = [System.DateTime]::UtcNow.Ticks % 100000000; $n = $r.Next($seed)%21; #$r.Next(0,21); if (($c - [system.char]::ToUpper($c)) -eq 32) { $target += $consonants[21+$n] } else { $target += $consonants[$n]; } } elseif ($c -in $vowels) { $seed = [System.DateTime]::UtcNow.Ticks % 100000000; $n = $r.Next($seed)%5; if (($c - [system.char]::ToUpper($c)) -eq 32) { #$n = $r.Next(0,5); $target += $vowels[5+$n] } else { #$n = $r.Next(0,5); $target += $vowels[$n]; } } elseif ($c -in $digits) { if ($KeepDigit) { $target += $c; } else { $seed = [System.DateTime]::UtcNow.Ticks % 100000000; $n = $r.Next($seed)%10; $target += $digits[$n]; } } else # for non-alpha-numeric characters, such as space, comma, bracket etc { $target += $c; } }#for loop } End { if ($rtn_source) { if ($SourceData -eq [System.DBNull]::Value) { write-output ([System.DBNull]::Value) } else {write-out $SourceData; } } else { Write-Output $target; } } } #randomize-string
Let's look at a few examples here.
Randomize-String -Source "Hello World, it is 8:10:33 am now !" Randomize-String -Source "Hello World, it is 8:10:33 am now !" -KeepDigit
The result looks like this:
For the 1st command, every alphanumeric character is changed, but if the character is upper/lower case, it remains upper/lower case after the change. The non-alphanumeric characters, such as comma, space, etc. are not changed.
For the 2nd command, the numeric characters are not changed, i.e. 8:10:33 remains the same, while other characters are changed.
Now let's look at examples using the other parameters.
Randomize-String -Source "Hello World, good morNing" -StartNoChange 0 -EndNoChange 3andomize-String -Source "Hello World, good morNing" -StartNoChange 0 -EndNoChange 3 -Direction R
The result looks like this:
For the 1st command, the first 4 characters are not changed (because Direction = 'L', the default value), the rest are changed.
For the 2nd command, the last 4 characters are not changed, while the rest are changed.
SQL CLR Function Solution
SQL CLR function is well-known for its great performance, especially related to string operations.
using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Linq; public partial class UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction] public static SqlString RandomizeString(SqlString SourceString, bool KeepDigit, Int32 StartNoChange, Int32 EndNoChange, char Direction) { char[] consonants = "BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz".ToCharArray(); char[] vowels = "AEIOUaeiou".ToCharArray(); char[] digits = "0123456789".ToCharArray(); char[] src_array; //Stopwatch sw = Stopwatch.StartNew(); //while (sw.ElapsedTicks <= 1) { }; //sw.Stop(); int seed = (int) DateTime.Now.Ticks%10000000; Random rdm = new Random(seed); int n = 0, i = 0, len = 0; string rtn = string.Empty; try { Direction = char.ToUpper(Direction); if (Direction != 'L' && Direction != 'R') { throw new Exception("Invalid Parameter [Direction], which needs to be 'L' or 'R'"); } if (SourceString.IsNull) { return new SqlString(); } src_array = SourceString.Value.ToCharArray(); len = src_array.Length; if (EndNoChange <= 0 && StartNoChange > 0) { if (Direction == 'L') { EndNoChange = len; } else { EndNoChange = len - StartNoChange - 1; StartNoChange = 0; } } if (Direction == 'R' ) { StartNoChange = len - StartNoChange - 1; EndNoChange = len - EndNoChange - 1; int tmp = StartNoChange; StartNoChange = EndNoChange; EndNoChange = tmp; } for (i = 0; i <= len - 1; i++) { char c = src_array[i]; if (i >= StartNoChange && i <= EndNoChange) { rtn += c; continue; } if (vowels.Contains(c)) { seed = (int)(DateTime.UtcNow.Ticks % 1000000000); n = rdm.Next(seed) % 5; if ((c - System.Char.ToUpper(c)) == 32) { n += 5; } rtn += vowels[n]; } else if (consonants.Contains(c)) { seed = (int) (DateTime.UtcNow.Ticks % 1000000000); n = rdm.Next(seed) % 21; if ((c - System.Char.ToUpper(c)) == 32) { n += 21; } rtn += consonants[n]; } else if (digits.Contains(c)) { if (KeepDigit) { rtn += c; } else { seed = (int)(DateTime.UtcNow.Ticks % 1000000000); n = rdm.Next(seed) % 10; rtn += digits[n]; } } else // for all other non-alphanumeric letters, we just keep it { rtn += c; } } //for loop } //try catch (Exception ex) { throw new Exception("\n\n" + ex.Message); } // SqlString ReturnString; // Put your code here return new SqlString(rtn); } // RandomizeString }
To create the function in SQL Server, we first need to compile this into a DLL file. I used Visual Studio 2015 Community version and set the function to SQL Server 2012 type. Once the DLL file is ready, we can do the following.
use mssqltips; -- change to your own alter database mssqltips set trustworthy on; go exec sp_configure 'clr enabled', 1; reconfigure; go CREATE ASSEMBLY [clrFunction] AUTHORIZATION [dbo] from 'C:\Users\jeffrey\Documents\Visual Studio 2015\Projects\clrFunction\clrFunction\bin\Debug\clrFunction.dll'; --change to your location go CREATE FUNCTION [dbo].[fn_RandomizeString] (@SourceString nvarchar(max), @KeepDigit bit=0, @StartNoChange int =0, @EndNoChange int=-1, @Direction NChar(1)='L') RETURNS nvarchar(max) AS EXTERNAL NAME [clrFunction].[UserDefinedFunctions].[RandomizeString]; go
We can run a few T-SQL statements quickly to test the function.
use MSSQLTipsselect [dbo].[fn_RandomizeString]('Hello World, it is 8:10:33 am now !', default, default, default, default); waitfor delay '00:00:00.020'; select [dbo].[fn_RandomizeString]('Hello World, good morNing', 1, default,default, default);
Here is the result.
Another test for the other parameters.
use MSSQLTipselect [dbo].[fn_RandomizeString]('Hello World, good morNing', default, 0, 3, default); waitfor delay '00:00:00.020'; select [dbo].[fn_RandomizeString]('Hello World, good morNing', default, 0, 3, 'R');
Here is the result.
Real Test on Masking an Entire Table
We first prepare the test environment, I am using SQL Server 2012 here.>
We are going to create an Address table generated with data from [AdventureWorks2012].[person].[Address].
use MSSQLTips; if object_id('dbo.Address', 'U') is not null drop table dbo.Address; go create table dbo.address (AddressID int identity primary key , AddressLine1 nvarchar(120) , AddressLine2 nvarchar(128) , PostalCode nvarchar(30)) go insert into dbo.Address (AddressLine1, AddressLine2, PostalCode) select AddressLine1, AddressLine2, PostalCode from AdventureWorks2012.person.Address -- assume your have AdventureWorks2012 installed go 5 -- generate about 98,000 records
Now let's prepare the PowerShell test code.
#this is to mask a sql server table with random values for its varchar type of columns #change connection string as per your environment $conn = new-object "System.Data.SqlClient.SqlConnection" ("server=$env:computername; database=mssqltips; trusted_connection=$true"); $da = new-object "System.Data.SqlClient.SqlDataAdapter" ("select * from dbo.address", $conn); $conn.Open(); $tbl = New-Object System.Data.datatable; [System.Data.SqlClient.SqlCommandBuilder] $cb = New-Object "System.Data.Sqlclient.sqlCommandBuilder"($da); $da.Fill($tbl) | Out-Null; write-host "starting to update..." -ForegroundColor Yellow; $dt = get-date; for ($i = 0; $i -lt $tbl.rows.count; $i++) { $tbl.Rows[$i].AddressLine1 = Randomize-String -SourceData $tbl.Rows[$i].AddressLine1 -KeepDigit; if ($tbl.Rows[$i].AddressLine2 -ne [System.DBNull]::Value) { $tbl.Rows[$i].AddressLine2 = Randomize-String -SourceData $tbl.Rows[$i].AddressLine2;} $tbl.Rows[$i].PostalCode = Randomize-String -SourceData $tbl.Rows[$i].PostalCode -StartNoChange 0 -EndNoChange 2; } $da.update($tbl) | out-null; (get-date) - $dt; # display how long it takes $conn.close(); $da.Dispose();
On my 5 year old laptop, it takes about 3 minutes to do the work.
To better understand the PS test code, you may need to know how to update a data source via DataAdapters as shown in this MSDN link.
We will test the performance of CLR function approach. This is actually much easier and more concise as we only need a few lines of T-SQL code.
use MSSQLTipsdeclare @dt datetime = getdate(); update dbo.Address set [AddressLine1]=dbo.fn_randomizestring(AddressLine1, 1, default, default, default) , [AddressLine2]= dbo.fn_randomizestring(AddressLine2, default, default, default, default) , [PostalCode] = dbo.fn_RandomizeString(PostalCode, default, 0, 2, default); select [ElapsedTime_seconds]= datediff(second, @dt, getdate());
The result is amazing, total time to do the same amount of updates takes only 5 seconds.
Do a quick select and we can see the updated as shown below.
Summary
In this tip, we provided two approaches to mask sensitive data in a SQL Server table, this is often called data obfuscation after setting up the data from production in the lower level environments, like QA, UAT, PRE or Dev environments. The PowerShell approach is flexible by using ADO.Net to update data without needing to install anything on the SQL Server, but its performance is not as good to handle large amounts of data. The SQL CLR function is more complex for maintenance and needs to be installed on the target server, but the performance is super.
Actually there is a 3rd option, i.e. using native T-SQL to create a UDF (User-Defined Function). The performance of this T-SQL UDF is better than PowerShell solution, but still much worse (i.e. 7 to 9 times longer) than the CLR solution.
If you want to have a more sophisticated solution, you can use third party tools as mentioned in the Next Steps section below.
Next Steps
You can read the following articles to better understand data masking and other methods to do this masking work:
- Masking Personal Identifiable SQL Server Data
- Protect and Mask Sensitive SQL Server Data for Development and Test Environments
- Dynamic Data Masking in SQL Server 2016
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2016-12-12