Remove HTML tags from strings using the SQL Server CLR

Problem

There are situations when you may need to remove HTML tags from your character string data. As an example, consider having to submit a product data feed to a search engine like Google. The detailed product description is mandatory in this case. It is recommended that you remove all special characters and HTML formatting. This task can be handled in TSQL code, however in this case I have the opportunity to use .NET and the power of the regular expressions to manage the string. In this tip, I'll build a CLR function which cleans up a string of HTML tags and special characters. I'll use Visual Studio 2010 with C# as the programming language. Check out this tip for my solution.

Solution

In one of my previous tips I've detailed the steps you need to build and deploy a CLR user defined function from Visual Studio 2010. Below are the steps that need to be followed.

First, make sure that the CLR integration is enabled. This can be accomplished by using the server facets in SQL Server 2008, the Surface Area Configuration tool in SQL Server 2005 or run the following code:

sp_configure 'clr enabled', 1  
GO  
RECONFIGURE  
GO

Next, follow these steps:

Open Visual Studio 2010
Click on "New Project"
Choose the Database ---> SQL Server ---> Visual C# SQL CLR Database Project template.
Make sure that the project targets .NET 2 / .NET 3 / .NET 3.5.
Set up a connection to your database, test the connection and click OK
Right click on the project and add a user defined function as explained in the next section

Creating the user defined function in the SQL Server CLR

As you can see, before applying the regex pattern, which finds the HTML tags, I strip out the control characters from the final result. Afterwards, I find the HTML tags using the regex pattern and I replace them with an empty string.

Here is the code of the user defined function:

    using System;
    using System.Data.Sql;
    using System.Data.SqlTypes;
    using System.Collections.Generic;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTML(SqlString s){
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            StringBuilder tmpS = new StringBuilder(s1.Length);
            //striping out the "control characters"
            if (!Char.IsControl(s1[0])) tmpS.Append(s1[0]);
            for (int i = 1; i <= s1.Length - 1; i++)
            {
                if (Char.IsControl(s1[i]))
                {
                    if (s1[i - 1] != ' ') tmpS.Append(' ');
                }
                else
                {
                    tmpS.Append(s1[i]);
                }
            }
            string result = tmpS.ToString();
            //finding the HTML tags and replacing them with an empty string
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            return rgx.Replace(result, String.Empty);
        }
    }

Comparing the T-SQL and C# CLR code

This task can be carried out using TSQL - for example I could use this piece of simple and elegant code provided by Pinal Dave. It uses the STUFF TSQL string function to replace the HTML tags with empty strings. Should I use CLR or TSQL? Which is the right choice in this case?

To answer this question, I've done a little bit of testing. I used a ~61,000 records ProductDescription table, consisting of an integer ProductID column (clustered primary key) and a Description varchar(max) column which contains HTML markup. I used a slightly different .NET code, which deals only with the HTML replacement as shown below:

    using System;
    using System.Data;
    using System.Data.SqlClient;
    using System.Data.SqlTypes;
    using Microsoft.SqlServer.Server;
    using System.Text;
    using System.Text.RegularExpressions;
    public partial class UserDefinedFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction]
        public static SqlString CleanHTMLOnlyString(SqlString s)
        {
            if (s.IsNull) return String.Empty;
            string s1 = s.ToString().Trim();
            if (s1.Length == 0) return String.Empty;
            string pattern = @"<[^>]*?>|<[^>]*>";
            Regex rgx = new Regex(pattern);
            return rgx.Replace(s1, String.Empty);
        }
    }

In my test cases, I worked on a virtual machine with Windows Server Standard 2003 SP2, 2GB of RAM on SQL Server 2008 Developer Edition with SP2. This is a development machine with no other code running on this machine during the test.

I ran two pieces of code in separate SSMS windows, one using Pinal's function and the other using the CLR function as shown below:

    --window 1 - CLR function
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    GO
    SELECT ProductDescriptionID, dbo.CleanHTMLOnlyString([Description])
    FROM ProductDescription
    --window 2 - Pinal's TSQL function
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    GO
    SELECT ProductDescriptionID, dbo.udf_StripHTML([Description])
    FROM ProductDescription

Based on the testing environment listed above, using the T-SQL code the CPU time and the elapsed time are greater. There is an average CPU time of ~9 seconds with the T-SQL code vs. ~4 seconds with the CLR code. In addition, there is an average elapsed time of ~10 seconds with the T-SQL code vs. ~5 seconds with the CLR Code. When I ran this same code in a similar hardware environment, but with SQL Server 2005 SP4 Developer Edition, differences are greater. The results for CPU time were ~31 seconds for T-SQL code vs. ~4 seconds for CLR code as well as elapsed time of ~33 seconds for T-SQL code vs. ~5 seconds for CLR code.

But what is the price for this? A little bit of research led me to this article about the SQL CLR memory usage. Checking out the sys.dm_os_memory_clerks DMV I could see that the MEMORYCLERK_SQLCLR reserves and commits memory. I've run the query recommended by Steven Hemingray on my SQL Server 2008 development virtual machine and I obtained 8548 KB. Depending on the operating system, how busy the environment is and how the CLR code is written, you may experience memory pressure. Searching more, I've found 2 extreme situations here and here. So these are something to consider as well.

Next Steps

Explore here other situations when CLR might be a good choice. But...
...do not forget to follow this good advice
If you decide to use CLR code, test thoroughly that code on your environment.

About the author

Diana Moldovan is a DBA and data centric applications developer with 6 years of experience covering SQL 2000, SQL 2005 and SQL 2008.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2011-06-23