By: Simon Liew | Updated: 2024-10-04 | Comments (4) | Related: > TSQL
Problem
Full wild card searches (e.g., SQL LIKE '%search_term%') in Microsoft SQL Server can be slow and inefficient because they guarantee a scan on all the rows in the table. Are there any options to optimize SQL LIKE operator queries?
Solution
Optimizing case-insensitive, full wildcard searches with leading and trailing wildcards is challenging in SQL databases—these LIKE patterns do not benefit from indexing. This tutorial explores potential options to optimize these searches and examines common misconceptions.
Demo Setup
We will use the StackOverflow2013 database built on the SQL_Latin1_General_CP1_CI_AS collation as it fits our test requirement. The SQL tutorial will focus on the [dbo].[Comments] table, which contains 24,534,730 rows.
Benchmarking the Initial SQL LIKE Query
To establish the initial baseline, we will execute the SQL SELECT statement below that searches for the word "until" in the Text column of the [dbo].[Comments] table as specified in the WHERE clause.
SELECT COUNT(*) FROM [dbo].[Comments] WHERE Text LIKE '%until%' OPTION (MAXDOP 8) GO
This query is executed on a workstation with an Intel i7-13700 processor (2.1 GHz, 16 cores), 128 GB of RAM, and SQL Server 2019 Developer Edition. The Windows power option is set to Best Performance; otherwise, the query is seen to take double the benchmark duration.
The query is executed several times to ensure the table is loaded into the SQL Server buffer pool and the duration isn’t dependent on the storage subsystem.
The baseline query completes in 17 seconds consistently and returns a row count of 129,484 matching rows.
Wildcard Search Optimization Options
With the benchmark established, we will explore two options to improve the performance of this wildcard search. To ensure accuracy, the StackOverflow2013 database is restored from backup prior to each test to avoid interference from the previous test changes. Similarly, the query will be executed several times until the execution duration is consistent.
Utilizing Binary Collation
Binary collation affects how text data is compared and sorted. When binary collation is used, the comparison is done at the byte level rather than the character level, possibly leading to faster search performance.
Although the StackOverflow2013 database uses a non-binary, case-insensitive collation, we can achieve binary comparison by using the COLLATE clause in our queries without altering the table design.
To make the query case-insensitive, we apply the LOWER() or UPPER() function to both the Text column and the search term in the LIKE statement. Here is the SQL syntax:
SELECT COUNT(*) FROM [dbo].[Comments] WHERE LOWER(Text) COLLATE Latin1_General_100_BIN2 LIKE LOWER('%until%') OPTION (MAXDOP 8) GO
The execution duration drops to just 2 seconds and returns 129,484 rows, matching the baseline query result.
Outcome:
- The query completes in 2 seconds, returning the same row count as the benchmark query.
- No changes to the database collation are required; adding COLLATE to the column suffices.
Binary collations are based on the numeric values of characters (byte values) and thus avoid additional processing associated with linguistic collations. The documentation on the COLLATE clause also discusses how binary collations are case-sensitive and accent-sensitive, which can lead to faster comparisons due to their simplicity.
While binary collation improves performance, ensure that the case sensitivity requirement is maintained appropriately, as this could exclude relevant results if done incorrectly.
Utilizing Full-Text Search
Full-Text Search (FTS) is another good option in SQL Server for efficient text-based queries. FTS indexes are optimized for searching large text columns, dramatically speeding up wildcard searches.
The FTS query syntax has its own clauses, but is straightforward and easy to use. The asterisks are equivalent to the wildcard or percent sign as shown in this SQL statement:
SELECT COUNT(*) FROM [dbo].[Comments] WHERE CONTAINS(Text, '*until*') OPTION (MAXDOP 8) GO
The command below creates and populates the FTS index on the Text column of the [dbo].[Comments] table. You would want to allow a few minutes for the FTS population to complete before running the benchmark query.
USE StackOverflow2013 GO CREATE FULLTEXT CATALOG FC_MSSQLTips_Demo AS DEFAULT; GO CREATE FULLTEXT INDEX ON dbo.Comments (Text LANGUAGE 1033) KEY INDEX PK_Comments_Id ON FC_MSSQLTips_Demo WITH STOPLIST = SYSTEM; GO ALTER FULLTEXT INDEX ON dbo.Comments START FULL POPULATION; GO
The FTS query completes in a sub-second, a major improvement. However, it returns 127,935 rows. This is 1,549 fewer rows than the original query.
Outcome:
- The query execution finishes in 37 ms, a substantial improvement over the original 17 seconds.
- However, this speed came at a cost. The query returned 127,935 rows, 1,549 fewer than the original query.
FTS can significantly enhance performance; however, it may come at the cost of accuracy, depending on your requirements. FTS does not seem to get the leading wildcard unless the search term is a full word following the wildcard.
In the screenshot below, the search term "until" is not the leading word, causing the query to miss it and highlighting a limitation of FTS. This limitation may be acceptable in some cases. For example, if users want to search within product descriptions and expect results to contain the search word regardless of word placement, then FTS might not be a good option.
Wildcard Searches and Indexes
A common misconception is that indexes can help optimize wildcard searches. Let’s examine two types of indexes to see if they offer any performance benefits.
Non-clustered Index
A non-clustered index is a smaller, separate structure of the indexed column that is not part of the main table. There is a partial truth that indexes can help with wild card search performance. It will only work when the wildcard is placed at the end of the search term.
CREATE NONCLUSTERED INDEX IX_Comments_Text ON dbo.Comments (Text) GO
The ad-hoc query below is adjusted to maintain only the wild card at the tail of the search term:
SELECT COUNT(*) FROM [dbo].[Comments] WHERE Text LIKE 'until%') OPTION (MAXDOP 8) GO
Outcome:
- The non-clustered index offers no performance improvement for full wild card searches.
Column Store Index
Another test is to create a non-clustered column store index, which is often beneficial in large-scale data warehousing environments. The main benefit of column store index is on large aggregations, not so much on text columns, especially searches.
CREATE NONCLUSTERED COLUMNSTORE INDEX NCCS_Comments_Text ON dbo.Comments (Text) GO
Outcome:
- The query took 4 seconds longer to complete execution than it did on the row store index.
- Column store indexes are generally not well-suited for wild card searches, especially on text-heavy columns.
Conclusion
Optimizing wildcard searches in SQL Server is complex and highly dependent on the specific use case. Techniques such as binary collation and Full-Text Search can yield significant performance improvements, but each has trade-offs. Testing and profiling SQL queries are crucial to determine the most effective approach.
While non-clustered and column store indexes are generally useful, they offer limited benefit for text-heavy wildcard searches.
Next Steps
- COLLATE
- CONTAINS
- SQL Server Collation Overview and Examples
- Collation and International Terminology
- Guidelines for Using BIN and BIN2 Collations
- Full-Text Search
- Collation and Unicode support
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2024-10-04