By: Jeffrey Yao | Updated: 2017-03-29 | Comments (1) | Related: > SQL Server 2016
Problem
It is well-known that T-SQL does not have native Regular Expression (RegEx) support for string related queries. For example, you cannot search records with a regular expression in WHERE clause. I know I can write a CLR function / stored procedure with a RegEx do the work, but is there any other way that I can query a table with regular expressions in T-SQL code?
Solution
SQL Server 2016 has integrated the R language, i.e. T-SQL can call an R script directly and return the result set consumable by the SQL Server database engine directly. With the power of the R language and its native support for RegEx, we can now query a table with RegEx through R.
We will look at a few examples to demonstrate how to use RegEx in processing queries against tables.
Here are a few requirements:
- Find records with phone numbers which may have different formats
- Extract those phone numbers
- De-duplicate words that are in sequence, such as "hello hello world world" should be "hello world" after de-duplication.
- Merge multiple new lines to one line
Test Environment Setup
We will create a table with a few sample records as shown below:
use TestDB -- assume this is our testing database go if object_id('dbo.tblRegEx', 'U') is not null drop table dbo.tblRegEx create table dbo.tblRegEx (id int identity, a varchar(300), b varchar(300) ); go insert into dbo.tblRegEx (a, b) values ('hello hello hello world', 'my phone number is 321-111-1111') , ( 'this this is is really fun','this number (604) 991-9111 is my cell phone') , ( 'no duplicate here', 'no phone number here either, just my license# 111 111 2222') , ( 'multiple blank lines --this is 6th line', '222 333-4444 is my cell phone');
This is a very simple table with 3 columns, 2 of which are of varchar data type and will be used for our RegEx operation.
Quick Explanation of Test Data
- First two records (id=1, 2) have their [a] column with duplicate words.
- Three records have a phone number in the [b] column, but with different formats, i.e. 321-111-1111, (604) 991-9111 and 222 333-4444.
- For id=3 record, in its [b] column, we have a string of "111 111 2222", which looks like a phone number, but it is not as it misses a hyphen between 111 and 2222.
- The [a] column for the id = 4 record, has a value with 4 continuous empty lines, see the below screenshot with "Results to Text" turned on.
T-SQL Code with R
RegEx is a powerful tool yet hard to master, when used properly, it provides an elegant and efficient solution. In the R language, RegEx is a maturely implemented feature which we can take advantage of.
We will first list the T-SQL code and the execution results, and then a detailed explanation of each section of the script.
USE TestDB --1. Find the records that contains phone number exec sp_execute_external_script @language=N'R' , @script = N' pattern = "\\b\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b" outData <- subset(inData, grepl(pattern, b, perl = T))' , @input_data_1 = N'select id, a, b from dbo.tblRegEx' , @input_data_1_name = N'inData' , @output_data_1_name=N'outData' with result sets ( as object dbo.tblRegEx); --2. Extract the phone number exec sp_execute_external_script @language=N'R' , @script = N' pattern = "(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)"; m <- regexpr(pattern, inData$b, perl = T) n <- rep(NA,length(inData$b)) n[m != -1] <- regmatches(inData$b, m) inData$c <- n; outData <- inData;' , @input_data_1 = N'select id, a, b from dbo.tblRegEx' , @input_data_1_name = N'inData' , @output_data_1_name=N'outData' with result sets ( (id int, a varchar(300), b varchar(300), phone varchar(20)) ); -- 3. Remove duplicate words exec sp_execute_external_script @language=N'R' , @script = N' pattern <-"\\b(\\w+\\s*)(\\1\\s*)+"; inData$a <- gsub(pattern, "\\1", inData$a, perl = T ); outData <- inData;' , @input_data_1 = N'select id, a, b from dbo.tblRegEx' , @input_data_1_name = N'inData' , @output_data_1_name=N'outData' with result sets ( as object dbo.tblRegEx); -- 4. Merge multiple empty lines to one line exec sp_execute_external_script @language=N'R' , @script = N' pattern = "(?m)(\\s*\\r?\\n){2,}"; # m <- regexpr(pattern, x , perl = T) inData$a <- gsub(pattern, "\r\n\r\n", inData$a, perl = T ); outData <- inData;' , @input_data_1 = N'select id, a, b from dbo.tblRegEx' , @input_data_1_name = N'inData' , @output_data_1_name=N'outData' with result sets ( as object dbo.tblRegEx);
Each T-SQL statement does two things:
- Read the table dbo.tblRegEx and pass the result set into R via the inData variable.
- Inside the R script, do the RegEx operation on the dataset of inData.
RegEx Explained
The following is a brief explanation of RegEx (i.e. pattern variable) used in the script
Note, in R, for a regular backslash \ in other language, such as C#, it needs to be doubled in R, i.e. \\.
- \\b\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b: phone number format
- \\b: word boundary
- \\(?: character ( once or none as ? means once or none
- \\d{3}: followed by 3 digits
- \\)?: followed by character ) once or none as ? means once or none
- [-\\s]: followed by either a hyphen - or a blank space, i.e. \\s
- \\d{3}: followed by 3 digits
- -: followed by a hyphen -
- \\d{4}: followed by 4 digits
- \\b: word boundary
- (\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b): extract the phone number. This is almost identical to previous one, the outside () is to capture the match
- \\b(\\w+\\s*)(\\1\\s*)+: repeatable words format
- \\b: word boundary
- (\\w+\\s*): a word followed by one or more spaces
- (\\1\\s*): \\1 means the exact copy of previous finding by (\\w+\\s*)
- +: means repeating one or more times
- (?m)(\\s*\\r?\\n){2,}: two or more continuous blank lines
- (?m): set the string to multiple line mode, so each line is analyzed via the RegEx pattern
- (\\s*\\r?\\n): \\s*: zero or more blank spaces, \\r is carriage return, ? means 0 or 1 time, \\n is newline
- {2,}:the previous match will be repeated one or more times.
Query Result
The following is the result of running the four R embedded T-SQL statements
We can see four data results just as expected:
- First data result, we get exactly 3 records that have phone numbers.
- Second result, we indeed extract phone numbers out, and we add an additional column [phone] for these phone numbers.
- Third result, we eliminate the duplicate words.
- Fourth result, we merge the multiple empty lines to one, and this is not very clear, so we will switch to "Results to Text" to look at it
To look at the result of id=4 row for merge multiple empty lines to one line, we need to first dump the processed records into a temp table and then select the row with id = 4.
USE TestDB select * into #t from dbo.tblRegEx where 1=0 insert into #t (id, a, b) exec sp_execute_external_script @language=N'R' , @script = N' pattern = "(?m)(\\s*\\r?\\n){2,}"; inData$a <- gsub(pattern, "\r\n\r\n", inData$a, perl = T ); outData <- inData;' , @input_data_1 = N'select id, a, b from dbo.tblRegEx' , @input_data_1_name = N'inData' , @output_data_1_name=N'outData' --with result sets ( as object dbo.tblRegEx); go -- turn on "Results to Text" option select a from #t where id=4 select a from dbo.tblRegEx where id =4
The result is as follows
Summary
With the help of embedded R in SQL Server 2016, we can now take advantage of regular expressions to process some string related query and manipulation operations. This saves the overhead to write CLR objects to achieve the same result. This can be very handy for some business cases.
In this tip, we look at some examples about using RegEx in R, and hope this will help you to broaden your view on what R integration in SQL Server 2016 can help us in some niche scenarios.
Next Steps
Understanding RegEx can help solve lots of sophisticated issues that otherwise are too complicated to solve. It is worthwhile to spend some time to understand what RegEx is and what it can do.
- PowerShell and T-SQL Regular Expression Examples for SQL Server
- All other RegEx related articles here at mssqltips.com
- Regular Expressions with The R Language
- Regular Expressions as used in R
You may try to do the following exercise:
If I have a column that contains IP address, i.e. in the format like 10.211.19.111, or a column that contains email address, can you validate such column to see whether the value is valid with R regular expressions?
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2017-03-29