By: Ian Fogelman | Updated: 2020-07-30 | Comments (1) | Related: > TSQL
Problem
As you may already know, in SQL Server 2017 and later, SQL Server has the ability to execute both R and Python scripts natively. This produces some interesting functionality and patterns that become available because of this feature.
Today we look at the ability to integrate regular expressions straight into our T-SQL statement result sets with a dynamic stored procedure using the Python native feature and a few pre-installed Python libraries.
Before we begin, regular expressions are somewhat supported in the WHERE clause of SQL Server, for instance saying where Field1 like '[0-9]' is using regex type capabilities. However, returning results on a regex match is not natively supported, this means you would have to develop a CLR procedure to do this. Integrating Python is much easier and does not require as much development time as we will see in this tip.
Solution
First off, let's enable external scripts:
--Advanced Options Enable External scripts sp_configure 'external scripts enabled', 1; RECONFIGURE WITH OVERRIDE;
Once that is enabled, we should be able to run a test python script:
EXEC sp_execute_external_script @language =N'Python', @script=N'import sys; print("\n".join(sys.path))'
This will confirm the location that our Python-SQL server environment is running, see subheading section for details of adding additional Python libraries to this environment. If Python is not installed, check out this tutorial.
Stored Procedure for String Evaluation
This stored procedure accepts two dynamic parameters, the regular expression pattern and the evaluation string.
Building the procedure like this gives a lot of flexibility to return our evaluation criteria and allows for a single procedure to handle all of our regex needs.
IF EXISTS(SELECT * FROM SYS.procedures WHERE NAME = 'RegexSelect') DROP PROC RegexSelect GO CREATE PROCEDURE DBO.RegexSelect ( @REGEXEXPRESSION VARCHAR(MAX), @STRING VARCHAR(MAX) ) AS DECLARE @PYTHON NVARCHAR(MAX) SET @PYTHON = ' import pandas as pd import re import sys from io import StringIO Test_String = "' + @STRING + '" Regex_Pattern = r"'+ @REGEXEXPRESSION + '" match = re.findall(Regex_Pattern, Test_String) out = "" for x in match: out += x df = pd.DataFrame({"results":[out]}) OutputDataSet = df' EXECUTE sp_execute_external_script @language = N'Python' , @script = @PYTHON , @params = N'@P1 VARCHAR(MAX),@P2 VARCHAR(MAX)' , @P1 = N' @REGEXEXPRESSION VARCHAR(MAX)' , @P2 = N'@STRING' WITH RESULT SETS((RESULT VARCHAR(MAX))); GO
After creating the stored procedure lets test the execution:
--SAMPLE EXECUTIONS FOR DIFFERENT PATTERNS EXEC DBO.RegexSelect '(?<=My)(.*?)(?=ABC)','My name ABC' --EXTRACT STRING BETWEEN TWO STRINGS EXEC DBO.RegexSelect '\d\d\D\d\d\D\d\d\d\d','XXXXXXXXXXXXXX06-11-2015YYYYYYYYY' --EXTRACT DATE PATTERN EXEC DBO.RegexSelect '[0-9]','ABC1DEF2GHI3' --EXTRACT DIGITS FROM STRING
Now that we verified that our new procedure is firing let's propose a use case for our newly forged procedure.
Let's imagine that you have a free field to collect phone numbers, that is of type varchar(12). This field will accept parenthesis for area code, hyphens if the user chooses and numeric characters. Using our new procedure, we can extract the numeric characters of these fields.
First, let's build the stage table that we will be processing:
--EXAMPLE PARSING A SINGLE FORMAT TO PHONE NUMBERS IF OBJECT_ID('TEMPDB..#TEMP_PROCESS') IS NOT NULL DROP TABLE #TEMP_PROCESS SELECT '911' AS PHONENUMBER --EMERGENCY NUMBER INTO #TEMP_PROCESS UNION SELECT '+1442041349987' --INTERNATIONAL NUMBER UNION SELECT '(123)-6782134' --US FORMAT WITH AREA CODE UNION SELECT '123-321-4412' --US FORMAT HYPHEN DELIMITED UNION SELECT '8762341414' --US FORMAT NOT DELIMITED
Next, we will build a results table:
--RESULTS TEMP TABLE IF OBJECT_ID('TEMPDB..#TEMP_CLEANED') IS NOT NULL DROP TABLE #TEMP_CLEANED CREATE TABLE #TEMP_CLEANED ( CLEANPHONENUM VARCHAR(MAX) )
Now let's process all the data in our stage table and push it to the results table:
DECLARE @PHONENUMBER VARCHAR(MAX) WHILE EXISTS(SELECT * FROM #TEMP_PROCESS) BEGIN SET @PHONENUMBER = (SELECT TOP 1 PHONENUMBER FROM #TEMP_PROCESS) INSERT INTO #TEMP_CLEANED EXEC dbo.RegexSelect '[0-9]',@PHONENUMBER DELETE FROM #TEMP_PROCESS WHERE PHONENUMBER = @PHONENUMBER END
Lastly, let's view the data from the results of our stored procedure:
SELECT * FROM #TEMP_CLEANED
And clean up our temp table:
DROP TABLE #TEMP_CLEANED,#TEMP_PROCESS
Managing the SQL Python Environment
As a quick note, if you ever want to add modules to your python – SQL Server specific environment here is how you can do that. As a small note this Python environment is completely different than if you have an Anaconda or Python stand alone install already on your workstation.
Inside of your command prompt:
cd C:\Program Files\Microsoft SQL Server\SERVERINSTANCE\PYTHON_SERVICES\Scripts pip list
pip is a CLI tool that helps manage Python libraries, if you run pip in this directory you will see exactly what your SQL python scripts have access to and can add to that if you wish with pip install.
If you wish to install an additional Python library you can do so with the pip.exe as follows:
pip.exe install text-tools
Conclusion
This is one way to get regex type features in SQL Server. It would be nice if this was a function, but it is not possible to create a function when you use sp_execute_external_script, so stay tuned for other tips on how to get regex features in SQL Server.
Next Steps
Check out these related articles:
- Using Python and Regex for SQL Server ETL Needs with Non-Standard Text Files
- Using Regular Expressions With T-SQL: From Beginner To Advanced
- Installing and Setting up SQL Server to use Python
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2020-07-30