SQL Server Regular Expressions for Data Validation and Cleanup

By:   |   Updated: 2011-02-09   |   Comments (10)   |   Related: 1 | 2 | > Functions User Defined UDF


Problem

You need to provide data validation at the server level for complex strings like phone numbers, email addresses, etc. You may also need to do data cleanup / standardization before moving it from source to target. Although SQL Server provides a fair number of string functions, the code developed with these built-in functions can become complex and hard to maintain or reuse.

Solution

.NET Base Class Library provides the System.Text.RegularExpressions namespace which contains classes with access to the .NET Framework regular expressions engine. Using these classes you can use .NET languages in addition to the T-SQL programming language to create database objects like stored procedures and functions to operate on strings to retrieve and update data in SQL Server databases.

For example, a regular expression can be applied to a text field, to search for a first pattern, then modify the text field to conform to a second pattern. Regular expressions use a language of their own, specifically designed for string processing, with some of the most popular listed below:

  • Internet Email: \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
  • Internet URL: http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
  • US Phone Number: ((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}
  • US Social Security Number: \d{3}-\d{2}-\d{4}
  • US ZIP Code: \d{5}(-\d{4})?
  • URI: \b(\S+)://([^:]+)(?::(\S+))?\b

Describing how these regular expressions are constructed and work is beyond the scope of this article.

 

In the tip presented here we will show the code necessary to develop a SQL CLR C# function that validates an input string, and in the case of finding a match, outputs a transformed string. (To learn more about creating a CLR function, refer to this tip.) To be more concrete, the function takes an input string believed to contain a character representation of a phone number; if the assumption is true, the function will output a string in a compact form, containing only the digits.

Although pretty straightforward, the code sample below has comments to improve clarity. 

This function takes three parameters:

  • InputString
  • MatchPattern
  • ReplacementPattern
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
// namespace to work with regular expressions
using System.Text.RegularExpressions;
public class cls_RegularExpressions
{
  [SqlFunction]
  public static SqlString ReplaceMatch(
    SqlString InputString,
    SqlString MatchPattern,
    SqlString ReplacementPattern)
  {
    try
    {
      // input parameters must not be NULL
      if (!InputString.IsNull &&
          !MatchPattern.IsNull &&
          !ReplacementPattern.IsNull)
      {
      // check for first pattern match
      if (Regex.IsMatch(InputString.Value,
                        MatchPattern.Value))
        // match found, replace using second pattern and return result
        return Regex.Replace(InputString.Value,
                            MatchPattern.Value,
                            ReplacementPattern.Value);
      else
        // match not found, return NULL
        return SqlString.Null;
      }
      else
        // if any input paramater is NULL, return NULL
        return SqlString.Null;
    }
    catch
    {
      // on any error, return NULL
      return SqlString.Null;
    }
  }
};

A sample SQL Server run is shown below, where we pass in the phone number, the match pattern and the replacement pattern.  The first two examples pass the conversion, but the third example fails because of two "))" after the area code.

select dbo.ReplaceMatch
('(129).673-4192', '^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$', '$1$2$3')
go;
returned value: 1296734192
--------------------------
select dbo.ReplaceMatch
('(129.673-4192', '^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$', '$1$2$3')
go;
returned value: 1296734192
--------------------------
select dbo.ReplaceMatch
('(129)).673-4192', '^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$', '$1$2$3')
go;
returned value: NULL
--------------------

The advantages of the ReplaceMatch function are straightforward as it can be used inline of SQL statements and with a whole spectrum of regular expressions within your SQL Server (including Express/Compact) environment. You could also create a second function that will only check for a match and return a boolean value, without doing any pattern replacement.

Regular expressions provide a very powerful text-processing technology that every Developer/DBA should be at least familiar with.

Our tip here just attempts to get you started writing your own SQL CLR objects embedding this technology.

Next Steps
  • Compile, deploy, and use the ReplaceMatch function; enhance it to suit your needs.
  • Explore other uses of the ReplaceMatch function with different regular expressions.
  • Include using regular expressions and CLR in your SQL Server development toolkit.
  • Check other SQL CLR tips on this site.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Alex Tocitu Alex Tocitu is an engineer with an MSEE in Automation and MBA in Finance.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2011-02-09

Comments For This Article




Wednesday, December 30, 2015 - 11:01:13 AM - Oded Dror Back To Top (40330)

Hi there,

 I created the DLL but can you send the function dbo.ReplaceMatch you created

Thanks,

Ed Dror

 

 

 


Monday, March 14, 2011 - 4:30:11 PM - Eric Back To Top (13210)

Alex,

That is an informative article.  Regular expressions can be very useful for validation and even information extraction.  The nice thing about SQL Server is that the .NET framework can be utilized in so many different places.  I wrote a regular expression component for SQL Server Integration Services (SSIS) that can validate fields and extract backreferences from within a SSIS data flow.   If you are interested check out:

http://regextractor.codeplex.com

The RegExtractor is free and open source.  Enjoy!  But please provide me with feedback!

Eric


Thursday, March 3, 2011 - 4:41:03 AM - amol Back To Top (13103)

Nice article ...

Thanks,
-Amol
- http://www.sqlsuperfast.com

 


Thursday, March 3, 2011 - 3:09:28 AM - Dishdy Back To Top (13102)

AdamK,
This is a very nice web site regarding regular expressions and it is now in my favorites.
Do you know of any sources which does a similar job for SQL Server's PATINDEX function?
Thnks.


Thursday, March 3, 2011 - 2:56:23 AM - AdamK Back To Top (13101)

Regexes for email and URL are wrong. Full email regex is much more complicated (http://www.regular-expressions.info/email.html). URL does not have to start with http.


Thursday, March 3, 2011 - 2:42:48 AM - Dishdy Back To Top (13100)

So you're saying there is no equivalent PATINDEX of your list for emails, phone numbers, etc.?


Wednesday, February 9, 2011 - 2:40:23 PM - Alex Tocitu Back To Top (12893)

John,

Thanks for your comments.

Do you have any benchmarks and regex patterns that you've used against large sets of data?

What were you trying to accomplish?

Alex


Wednesday, February 9, 2011 - 2:07:14 PM - Alex Tocitu Back To Top (12892)

Mbourdon,
Thanks for your question;
Please call the function with parameters below
InputString; your string,
MatchPattern: '[\x00-\x1F]',
ReplacementPattern: ''
The function will return NULL
or a cleaned output if InputString contains ASCII 0-31 or 00-1F (HEX)


Wednesday, February 9, 2011 - 12:08:31 PM - John Fox Back To Top (12887)
Regular expressions can be great tools for data validation and manipulation. But they can also be CPU hogs, it takes CPU horsepower to evaluate expressions, especially as they become more and more complex. I'm not suggesting to not use them, but to be careful using them and watch your CPU overhead if they are used against large sets of data. Remember that they will be evaluated on the database machine, not on the application server itself. Consider the resource requirements and the optimal server placement, sometimes it might make more sense to place them on an application or web server instead of in the database. Here is a good starting article on how to use them. Be forewarned, there are many different regex engines, and they don't all work the same. Be sure to check the CLR documentation for specifics. http://www.regular-expressions.info/tutorial.html

Wednesday, February 9, 2011 - 11:17:41 AM - Mbourgon Back To Top (12886)

Alex, thanks for that.  How would you change that code (or maybe just how it's called) so that it would strip out, say, any low-ascii characters (under ASCII(31))?  I'd like to take a large string, pass it in, and get out clean data.  Everything but low-ascii would count as good.  Thanks.















get free sql tips
agree to terms