By: Diana Moldovan | Updated: 2011-06-23 | Comments (2) | Related: 1 | 2 | > Functions User Defined UDF
Problem
There are situations when you may need to remove HTML tags from your character string data. As an example, consider having to submit a product data feed to a search engine like Google. The detailed product description is mandatory in this case. It is recommended that you remove all special characters and HTML formatting. This task can be handled in TSQL code, however in this case I have the opportunity to use .NET and the power of the regular expressions to manage the string. In this tip, I'll build a CLR function which cleans up a string of HTML tags and special characters. I'll use Visual Studio 2010 with C# as the programming language. Check out this tip for my solution.
Solution
In one of my previous tips I've detailed the steps you need to build and deploy a CLR user defined function from Visual Studio 2010. Below are the steps that need to be followed.
First, make sure that the CLR integration is enabled. This can be accomplished by using the server facets in SQL Server 2008, the Surface Area Configuration tool in SQL Server 2005 or run the following code:
sp_configure 'clr enabled', 1 GO RECONFIGURE GO
Next, follow these steps:
- Open Visual Studio 2010
- Click on "New Project"
- Choose the Database ---> SQL Server ---> Visual C# SQL CLR Database Project template.
- Make sure that the project targets .NET 2 / .NET 3 / .NET 3.5.
- Set up a connection to your database, test the connection and click OK
- Right click on the project and add a user defined function as explained in the next section
Creating the user defined function in the SQL Server CLR
As you can see, before applying the regex pattern, which finds the HTML tags, I strip out the control characters from the final result. Afterwards, I find the HTML tags using the regex pattern and I replace them with an empty string.
Here is the code of the user defined function:
using System; using System.Data.Sql; using System.Data.SqlTypes; using System.Collections.Generic; using Microsoft.SqlServer.Server; using System.Text; using System.Text.RegularExpressions; public partial class UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction] public static SqlString CleanHTML(SqlString s){ if (s.IsNull) return String.Empty; string s1 = s.ToString().Trim(); if (s1.Length == 0) return String.Empty; StringBuilder tmpS = new StringBuilder(s1.Length); //striping out the "control characters" if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]); for (int i = 1; i <= s1.Length - 1; i++) { if (Char.IsControl(s1[i])) { if (s1[i - 1] != ' ') tmpS.Append(' '); } else { tmpS.Append(s1[i]); } } string result = tmpS.ToString(); //finding the HTML tags and replacing them with an empty string string pattern = @"<[^>]*?>|<[^>]*>"; Regex rgx = new Regex(pattern); return rgx.Replace(result, String.Empty); } }
Comparing the T-SQL and C# CLR code
This task can be carried out using TSQL - for example I could use this piece of simple and elegant code provided by Pinal Dave. It uses the STUFF TSQL string function to replace the HTML tags with empty strings. Should I use CLR or TSQL? Which is the right choice in this case?
To answer this question, I've done a little bit of testing. I used a ~61,000 records ProductDescription table, consisting of an integer ProductID column (clustered primary key) and a Description varchar(max) column which contains HTML markup. I used a slightly different .NET code, which deals only with the HTML replacement as shown below:
using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Text; using System.Text.RegularExpressions; public partial class UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction] public static SqlString CleanHTMLOnlyString(SqlString s) { if (s.IsNull) return String.Empty; string s1 = s.ToString().Trim(); if (s1.Length == 0) return String.Empty; string pattern = @"<[^>]*?>|<[^>]*>"; Regex rgx = new Regex(pattern); return rgx.Replace(s1, String.Empty); } }
In my test cases, I worked on a virtual machine with Windows Server Standard 2003 SP2, 2GB of RAM on SQL Server 2008 Developer Edition with SP2. This is a development machine with no other code running on this machine during the test.
I ran two pieces of code in separate SSMS windows, one using Pinal's function and the other using the CLR function as shown below:
--window 1 - CLR function SET STATISTICS IO ON SET STATISTICS TIME ON GO SELECT ProductDescriptionID, dbo.CleanHTMLOnlyString([Description]) FROM ProductDescription --window 2 - Pinal's TSQL function SET STATISTICS IO ON SET STATISTICS TIME ON GO SELECT ProductDescriptionID, dbo.udf_StripHTML([Description]) FROM ProductDescription
Based on the testing environment listed above, using the T-SQL code the CPU time and the elapsed time are greater. There is an average CPU time of ~9 seconds with the T-SQL code vs. ~4 seconds with the CLR code. In addition, there is an average elapsed time of ~10 seconds with the T-SQL code vs. ~5 seconds with the CLR Code. When I ran this same code in a similar hardware environment, but with SQL Server 2005 SP4 Developer Edition, differences are greater. The results for CPU time were ~31 seconds for T-SQL code vs. ~4 seconds for CLR code as well as elapsed time of ~33 seconds for T-SQL code vs. ~5 seconds for CLR code.
But what is the price for this? A little bit of research led me to this article about the SQL CLR memory usage. Checking out the sys.dm_os_memory_clerks DMV I could see that the MEMORYCLERK_SQLCLR reserves and commits memory. I've run the query recommended by Steven Hemingray on my SQL Server 2008 development virtual machine and I obtained 8548 KB. Depending on the operating system, how busy the environment is and how the CLR code is written, you may experience memory pressure. Searching more, I've found 2 extreme situations here and here. So these are something to consider as well.
Next Steps
- Explore here other situations when CLR might be a good choice. But...
- ...do not forget to follow this good advice
- If you decide to use CLR code, test thoroughly that code on your environment.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2011-06-23