SQL Server Functions to Mask Confidential Data using CLR or PowerShell

By:   |   Updated: 2016-12-12   |   Comments (4)   |   Related: > Security


Problem

We usually need to restore our production databases to our QA or other environment for various reasons, but due to data security requirements we need to mask sensitive data, like customer name, addresses, email addresses, phones numbers, etc.

To meet our various requirements for data quality, we want the following rules for our data during the masking transformation:

  1. A character, i.e. a to z or A to Z, should still be a character in the same case after masking the transaction
  2. Within the context of the 1st rule, a vowel, i.e. [aeiou] should still be a vowel [aeiou], while a consonant should still be a consonant, after masking.
  3. A number, i.e. 0 to 9, should still be a number after masking
  4. A non alphanumeric letter, like space, comma etc., remains as it is without change.
  5. Some part of a string can be required to not change, for example the first 3 digits (i.e. area code) of the phone number. We may need data like this, so we can calculate geographic statistics.

We do not want to simply update the data to be the same value of 'xxxxx' for all records, because that will compromise our test cases or reporting quality.

Solution

The solution is straight-forward, we need to process each string character by one character and do the transformation according to the rules.

Here is the detailed algorithm design:

  1. We will create a function that accepts a string as input (with a few other control input parameters) and return a masked string.
  2. We will create three arrays, one for vowels, one for consonants, and another for numbers
  3. We will parse each character of the input string and process as per rules. For example, if the character is a vowel, we will generate a random number and use this number as the index for the array of the vowels (character case will be taken into consideration). Similar logic will be used for consonants and numbers.
  4. If the character is not alphanumeric, there is no change.

This is the core of the algorithm and more will be discussed after all parameters are introduced.

So here is the function signature design:

string RandomizeString (string SourceString, bool KeepDigit, StartNoChange int, StopNoChange int, Direction char)

where
SourceString: the source string we want to do the conversion

KeepDigit: if the source string contains a numeric digits, do we want to keep it as it is or change to other random digit?

StartNoChange: we may want to keep a part of the source string unchanged, so this is the start position of that no-change part. This is 0 based
EndNochange: we may want to keep a part of the source string unchanged, so this is the end position of that no-change part.

Direction: only accepts two values 'L' = counting from Left, 'R'= counting from Right. This is used when we want to keep part of the string unchanged
if we want to keep the last 3 characters unchanged, the parameters will be StartChange=0, EndNoChange=2, and Direct = 'R'

So with these parameters, the algorithm needs some more logic.

  1. If KeepDigit = true, if a character is a numeric digit, no transformation, otherwise, use a random digit to replace it
  2. If Direction = 'R', we need to re-calculate the StartNoChange and EndNoChange to make sure the same position if we parse from the left, for example, if a string is 'abcdef', if Direction='R', StartNoChange=0 and EndNoChange=2, we want to re-calculate to make StartNoChange=3 and EndNoChange=5 as we are parsing the string from left to right.
  3. If a character's position in the source string is between StartNoChange and EndNoChange, no transformation is done.

There are two practical implementations, one is to create a CLR function and another is to use a non-CLR function like a PowerShell solution. Each has its own unique strengths.

For the CLR function, the strength is performance and the weakness is maintenance cost and possibly company policy restrictions.

For the PowerShell approach, the strength is flexibility, i.e. you do not need to install anything inside SQL Server instance. The weakness is performance.

I will provide source code for both solutions, so readers can choose according to their needs.

PowerShell Solution

The PowerShell solution can be best used against small tables with tens of thousands records.

<#.Synopsis
   Mask a string to hide the real value of the string
.DESCRIPTION
   Mask a string by changing each character randomly.
.EXAMPLE
   Randomize-String -SourceData "Hello World, good morning, it is 8:10am!" 
   
   This will return a masked string like the following
   
   Xecke Muwwf, fuot cutjisk, oh ef 7:95ez!

.EXAMPLE
   Randomize-String -SourceData "8:30 am, Hello World, good morning, it is 8:10am, nice2!"  -StartNoChange 7

    no change should happen starting at the 8th letter (StartNoChange is 0 based), you will get a string similar like

    7:08 un, Hello World, good morning, it is 8:10am, nice2!

.INPUTS
   SourceData: The source string to be masked
   KeepDigit: switch parameter, if present, numeric digits will not be masked. 
   StartNoChange: int value, starting at this position, no masking needed. Default is 0
   EndNochange: int value, ending at this position, no masking needed. Default is -1, meaning the end of the string.
   Direction: can be 'L' or 'R', 'L', 'L' means parsing string from Left. 'R' means parsing string from Right.

.OUTPUTS
   A randomly masked string.
#>
#requires -version 3.0
function Randomize-String
{
    [CmdletBinding()]
    [Alias()]
    [OutputType([String])]
    Param
    (
        # Param1 help description
        [Parameter(Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [AllowEmptyString()]
        [string]$SourceData,
        [Parameter(Mandatory=$false, ParameterSetName='Position')]
        [ValidateScript( {if ($_ -ge 0) { $true } else {throw "$_ must be 0 or greater"}  }  )]
        [int]$StartNoChange=0,
        [Parameter(Mandatory=$false, ParameterSetName='Position')]
        [int]$EndNoChange=-1,
        [Parameter(Mandatory=$false, ParameterSetName='Position' )]
        [ValidateSet('L', 'R')]
        [char]$Direction = 'L', #default from left
        [Parameter(Mandatory=$false )]
        [switch] $KeepDigit
    )
    Begin
    {
        [char[]]$vowels = 'A', 'E', 'I', 'O', 'U', 'a','e', 'i','o','u';
        [char[]]$consonants = 'BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz'.ToCharArray();
        [char[]]$digits ='0123456789'.ToCharArray();
        [string]$target = "";
        [int]$len = 0;
        [bool] $rtn_source = $false;
        [int]$seed = 0;
        #processing parameters
        if (($sourceData -eq $null) -or ($SourceData -eq [System.DBNull]::Value) -or $SourceData -eq '')
        {
            $rtn_source = $true;
            return;
         }
      
 
        if ($PSBoundParameters.ContainsKey('StartNoChange') )
        {
            $len = $SourceData.Length;
            if   ($EndNoChange -le 0)
            { if ($Direction -eq 'L') #counting from left to right
              { $EndNoChange = $len -1;
                #$NoChange = $SourceData.Substring($StartNoChange);
              }
              else
              { $EndNoChange = $len-$StartNoChange-1;
                $StartNoChange = 0;

              }
          
            }
            
              
            if ($Direction -eq 'R') #counting from left to right # counting from right to left
            {
               $StartNoChange = $len -$StartPostion-1;
               $EndNoChange = $len - $EndNoChange -1;
              
               #swap $StartNoChange, $EndNoChange
               [int]$tmp = $StartNoChange;
               $StartNoChange = $EndNoChange ;
               $EndNoChange = $tmp;
            }
          

        }# para StartNoChange is used

    } #begin
    Process
    {
        if($rtn_source)
        { return; }
        $r = New-Object System.Random;
        [int]$n=0;
        [int] $i = 0;
        [char[]]$ac= $SourceData.ToCharArray();
        for ($i=0; $i -lt $ac.length; $i++)
        {
 
          $c = $ac[$i];
          if ( ($i -ge $StartNoChange) -and ($i -le $EndNoChange) )
          { $target += $c; continue; }
 
          if ($c -in $consonants)
          {  
             $seed = [System.DateTime]::UtcNow.Ticks % 100000000;
             $n = $r.Next($seed)%21; #$r.Next(0,21);
    
             if (($c - [system.char]::ToUpper($c)) -eq 32)
             {
                $target += $consonants[21+$n]
             }
             else
             {
                $target += $consonants[$n];
             }
          }
          elseif ($c -in $vowels)
          {
             $seed = [System.DateTime]::UtcNow.Ticks % 100000000;
             $n = $r.Next($seed)%5;
             if (($c - [system.char]::ToUpper($c)) -eq 32)
             {
                #$n = $r.Next(0,5);
                $target += $vowels[5+$n]
             }
             else
             {
                #$n = $r.Next(0,5);
                $target += $vowels[$n];
             }
          }
          elseif ($c -in $digits)
          {
               if ($KeepDigit)
               { $target += $c; }
               else
               {
                 
                $seed = [System.DateTime]::UtcNow.Ticks % 100000000;
                $n = $r.Next($seed)%10;
                $target += $digits[$n];
               }
          }
          else # for non-alpha-numeric characters, such as space, comma, bracket etc
          {    $target += $c; }
        }#for loop
    }
    End
    {
        if ($rtn_source)
        { if ($SourceData -eq [System.DBNull]::Value)
          { write-output ([System.DBNull]::Value) }
          else 
          {write-out $SourceData; }
        }
        else
        { Write-Output $target; }
    }
} #randomize-string

Let's look at a few examples here.

Randomize-String -Source "Hello World, it is 8:10:33 am now !" 
Randomize-String -Source "Hello World, it is 8:10:33 am now !" -KeepDigit

The result looks like this:

Example 1

For the 1st command, every alphanumeric character is changed, but if the character is upper/lower case, it remains upper/lower case after the change. The non-alphanumeric characters, such as comma, space, etc. are not changed.

For the 2nd command, the numeric characters are not changed, i.e. 8:10:33 remains the same, while other characters are changed.

Now let's look at examples using the other parameters.

Randomize-String -Source "Hello World, good morNing" -StartNoChange 0 -EndNoChange 3andomize-String -Source "Hello World, good morNing" -StartNoChange 0 -EndNoChange 3 -Direction R

The result looks like this:

Example 2

For the 1st command, the first 4 characters are not changed (because Direction = 'L', the default value), the rest are changed.

For the 2nd command, the last 4 characters are not changed, while the rest are changed.

SQL CLR Function Solution

SQL CLR function is well-known for its great performance, especially related to string operations.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Linq;


public partial class UserDefinedFunctions
{
   [Microsoft.SqlServer.Server.SqlFunction]

   public static SqlString RandomizeString(SqlString SourceString, bool KeepDigit, Int32 StartNoChange, Int32 EndNoChange, char Direction)
    {
        char[] consonants = "BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz".ToCharArray();
        char[] vowels = "AEIOUaeiou".ToCharArray();
        char[] digits = "0123456789".ToCharArray();
        char[] src_array;
        //Stopwatch sw =  Stopwatch.StartNew();
        //while (sw.ElapsedTicks <= 1) { };
        //sw.Stop();
  
        int seed = (int) DateTime.Now.Ticks%10000000;

        Random rdm = new Random(seed);
        int n = 0, i = 0, len = 0;
        string rtn = string.Empty;


        try
        {
            Direction = char.ToUpper(Direction);

            if (Direction != 'L' && Direction != 'R')
            {
                throw new Exception("Invalid Parameter [Direction], which needs to be 'L' or 'R'");
                
            }

            if (SourceString.IsNull)
            {
                return new SqlString();
            }
            src_array = SourceString.Value.ToCharArray();

            len = src_array.Length;
            if (EndNoChange <= 0 && StartNoChange > 0)
            {
                if (Direction == 'L')
                { EndNoChange = len; }
                else
                {
                    EndNoChange = len - StartNoChange - 1;
                    StartNoChange = 0;
                }
            }
            if (Direction == 'R' )
            {
                StartNoChange = len - StartNoChange - 1;
                EndNoChange = len - EndNoChange - 1;
                int tmp = StartNoChange;
                StartNoChange = EndNoChange;
                EndNoChange = tmp;
            }

            for (i = 0; i <= len - 1; i++)
            {
                char c = src_array[i];
                if (i >= StartNoChange && i <= EndNoChange)
                {
                    rtn += c;
                    continue;
                }
                if (vowels.Contains(c))
                {
                    seed = (int)(DateTime.UtcNow.Ticks % 1000000000);
                    n = rdm.Next(seed) % 5;
                    if ((c - System.Char.ToUpper(c)) == 32)
                    { n += 5; }
                    rtn += vowels[n];
                }
                else if (consonants.Contains(c))
                {
                    seed =  (int) (DateTime.UtcNow.Ticks % 1000000000);
                    n = rdm.Next(seed) % 21;
                    if ((c - System.Char.ToUpper(c)) == 32)
                    { n += 21; }
                    rtn += consonants[n];
                }
                else if (digits.Contains(c))
                {
                    if (KeepDigit)
                    {
                        rtn += c;
                    }
                    else
                    {
                        seed = (int)(DateTime.UtcNow.Ticks % 1000000000);
                        n = rdm.Next(seed) % 10;
                        rtn += digits[n];
                    }
                }
                else // for all other non-alphanumeric letters, we just keep it
                {
                    rtn += c;
                }
            } //for loop
        } //try
        catch (Exception ex)
        {
            throw new Exception("\n\n" + ex.Message);
        }

        // SqlString ReturnString;

        // Put your code here
        return new SqlString(rtn);
    } // RandomizeString
 }   

To create the function in SQL Server, we first need to compile this into a DLL file. I used Visual Studio 2015 Community version and set the function to SQL Server 2012 type. Once the DLL file is ready, we can do the following.

use mssqltips; -- change to your own
alter database mssqltips set trustworthy on;
go
exec sp_configure 'clr enabled', 1;
reconfigure;
go
CREATE ASSEMBLY [clrFunction]
    AUTHORIZATION [dbo]
 from 'C:\Users\jeffrey\Documents\Visual Studio 2015\Projects\clrFunction\clrFunction\bin\Debug\clrFunction.dll'; --change to your location
go
CREATE FUNCTION [dbo].[fn_RandomizeString]
(@SourceString nvarchar(max), @KeepDigit bit=0, @StartNoChange int =0, @EndNoChange int=-1, @Direction NChar(1)='L')
RETURNS nvarchar(max)
AS EXTERNAL NAME [clrFunction].[UserDefinedFunctions].[RandomizeString];
go

We can run a few T-SQL statements quickly to test the function.

use MSSQLTipsselect [dbo].[fn_RandomizeString]('Hello World, it is 8:10:33 am now !', default, default, default, default);
waitfor delay '00:00:00.020';
select [dbo].[fn_RandomizeString]('Hello World, good morNing', 1, default,default, default);

Here is the result.

Example 3

Another test for the other parameters.

use MSSQLTipselect [dbo].[fn_RandomizeString]('Hello World, good morNing', default, 0, 3, default);
waitfor delay '00:00:00.020';
select [dbo].[fn_RandomizeString]('Hello World, good morNing', default, 0, 3, 'R');

Here is the result.

Example 4

Real Test on Masking an Entire Table

We first prepare the test environment, I am using SQL Server 2012 here.>

We are going to create an Address table generated with data from [AdventureWorks2012].[person].[Address].

use MSSQLTips;
if object_id('dbo.Address', 'U') is not null 
 drop  table dbo.Address;
go
create table dbo.address (AddressID int identity primary key
, AddressLine1 nvarchar(120)
, AddressLine2 nvarchar(128)
, PostalCode nvarchar(30))
go 

insert into dbo.Address (AddressLine1, AddressLine2, PostalCode)
select AddressLine1, AddressLine2, PostalCode 
from AdventureWorks2012.person.Address -- assume your have AdventureWorks2012 installed
go 5 -- generate about 98,000 records

Now let's prepare the PowerShell test code.

#this is to mask a sql server table with random values for its varchar type of columns

#change connection string as per your environment
$conn = new-object "System.Data.SqlClient.SqlConnection" ("server=$env:computername; database=mssqltips; trusted_connection=$true");

$da = new-object "System.Data.SqlClient.SqlDataAdapter" ("select * from dbo.address", $conn);
$conn.Open();
 
$tbl = New-Object System.Data.datatable;
 
[System.Data.SqlClient.SqlCommandBuilder] $cb = New-Object "System.Data.Sqlclient.sqlCommandBuilder"($da);
$da.Fill($tbl) | Out-Null;

write-host "starting to update..." -ForegroundColor Yellow;
$dt = get-date;
 
for ($i = 0; $i -lt $tbl.rows.count; $i++)
{
  $tbl.Rows[$i].AddressLine1 = Randomize-String -SourceData $tbl.Rows[$i].AddressLine1 -KeepDigit;
  if ($tbl.Rows[$i].AddressLine2 -ne [System.DBNull]::Value) { $tbl.Rows[$i].AddressLine2 = Randomize-String -SourceData $tbl.Rows[$i].AddressLine2;}
  $tbl.Rows[$i].PostalCode = Randomize-String -SourceData $tbl.Rows[$i].PostalCode -StartNoChange 0 -EndNoChange 2;
}

$da.update($tbl) | out-null;
 
(get-date) - $dt; # display how long it takes
 
$conn.close();
$da.Dispose();

On my 5 year old laptop, it takes about 3 minutes to do the work.

PS duration

To better understand the PS test code, you may need to know how to update a data source via DataAdapters as shown in this MSDN link.

We will test the performance of CLR function approach. This is actually much easier and more concise as we only need a few lines of T-SQL code.

use MSSQLTipsdeclare @dt datetime = getdate();
update dbo.Address
set [AddressLine1]=dbo.fn_randomizestring(AddressLine1, 1, default, default, default)
, [AddressLine2]= dbo.fn_randomizestring(AddressLine2, default, default, default, default)
, [PostalCode] = dbo.fn_RandomizeString(PostalCode, default, 0, 2, default);
select [ElapsedTime_seconds]= datediff(second, @dt, getdate());

The result is amazing, total time to do the same amount of updates takes only 5 seconds.

clr duration

Do a quick select and we can see the updated as shown below.

query result

Summary

In this tip, we provided two approaches to mask sensitive data in a SQL Server table, this is often called data obfuscation after setting up the data from production in the lower level environments, like QA, UAT, PRE or Dev environments. The PowerShell approach is flexible by using ADO.Net to update data without needing to install anything on the SQL Server, but its performance is not as good to handle large amounts of data. The SQL CLR function is more complex for maintenance and needs to be installed on the target server, but the performance is super.

Actually there is a 3rd option, i.e. using native T-SQL to create a UDF (User-Defined Function). The performance of this T-SQL UDF is better than PowerShell solution, but still much worse (i.e. 7 to 9 times longer) than the CLR solution.

If you want to have a more sophisticated solution, you can use third party tools as mentioned in the Next Steps section below.

Next Steps

You can read the following articles to better understand data masking and other methods to do this masking work:



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2016-12-12

Comments For This Article




Tuesday, December 13, 2016 - 12:50:29 PM - jeff_yao Back To Top (44960)

 

@Carlos, thanks for your kind words.


Tuesday, December 13, 2016 - 12:49:29 PM - jeff_yao Back To Top (44959)

 @Jeff Moden, I cannot help LOL when I imagine the scenario you mentioned. :-) But hey, we know they are randomly generated, so who cares. 

As for T-SQL UDF, it is actually using the same alogrithm as mentioned in the tip. However, since T-SQL UDF cannot allow any non-deterministic function (like Rand()), we have to create a view first, in which we use Rand() function.


Tuesday, December 13, 2016 - 8:51:15 AM - Jeff Moden Back To Top (44955)

Be very aware that this method could cause problems with "sensitive" people because, any time you randomly generate letters, you can and eventually will spell out some pretty offensive words.  My recommendation would be to at least replace all vowels with nothing.

It would also be interesting to see the code for the 3rd option (T-SQL UDF) that was used but not posted.


Tuesday, December 13, 2016 - 4:31:03 AM - Carlos Malaca Back To Top (44953)

 Dear Jeffrey. Thank you for tour tip. Carlos 

 















get free sql tips
agree to terms