By: Pablo Echeverria | Updated: 2022-05-20 | Comments | Related: > Full Text Search
Problem
As a follow up to "SQL Server vs Oracle: Data Loading", we want to perform sentiment analysis, but before doing that it's important to understand the data: common words, misspellings, stemming (root and derivatives of a word), etc. How is this performed in SQL Server, in Oracle, and what are the differences?
Solution
SQL Server Full Text Search and Oracle Text allows you to perform fast text lookup and fuzzy string matching in any character-type column without scanning the whole table on unstructured and semi-structured information, be it speech, documents, XML, bioinformatics, chemical or multimedia. You can query a BLOB data type where LIKE cannot be used, query suffixes where a LIKE performs a full table scan, and you can query over a linked server or database link. It allows you to specify the language you want (if supported) unlike the SOUNDEX and DIFFERENCE functions only supported in English, and it will handle appropriately languages that read from right to left and compound words.
Once created, it will allow you to perform the following types of searches:
- Search one or more words or phrases, a prefix, or a suffix. When the prefix is a phrase, each token is considered.
- Search inflectional forms of a word: "drive" will find "drives", "drove", "driving", "driven", etc. depending on the language specified both at the index creation time and at the query execution time.
- Search synonyms of a word: "car" will find "truck", but this is maintained manually, you can add your own custom business synonyms.
- Search one or more words or phrases close to other words or phrases, indicating the maximum number of non-search terms that separate the first and last search terms, and if the words must appear in order.
- Rank results, with specifying different weights for words or phrases.
- Search within the properties of a document.
- Statistical semantic (since a word can have multiple meanings or it changes within the context), but only single words are indexed, not multi-word phrases (N-grams), and inflectional forms and synonyms are indexed separately.
- To know what the output of a full text query will be without executing it.
Here are several considerations when creating a full text and/or semantic search index:
- This is a feature you choose to be installed separately from the database/engine.
- You need to account for enough memory when it is crawling the content and when executing a full text query.
- If you need to search multiple conditions, combining them in a single full text query outperforms multiple statements due to it uses special reverse index properties. You can use AND, OR and NOT within the same full text search predicate and combine different full text operators. For best performance, instead of including multiple words with OR, use the thesaurus specifying them as synonyms.
- Noise words are removed, but you can adjust them in the default stoplist or create your own stoplist and specify it when indexing and querying.
- You can adjust synonyms and replacements in the thesaurus file for the language used.
- For tables with high modify rates it is recommended to schedule manual updates instead of automatic.
- It suffers from parameter sniffing, and parameter sniffing solutions also apply with full text queries.
- If you need to handle multiple languages, don't mix them in the same column, or store the content as XML, HTML or a binary type which allow you to specify the language in different parts of the content.
Personally, I've used it in a document processing system which I developed, allowing the users to easily find the required template just by typing a few words contained in the template. In the next sections you will see how this is enabled for SQL Server and Oracle, and the differences.
Oracle Full Text Search
In the article "SQL Server vs Oracle: Data Loading" you learned how to generate data and insert into an Oracle database, but it didn't mention if your text contains commas the text is separated into fields, and because there was only one field in the table the results were trimmed at the first comma. To resolve that, the command to insert data needs to be as follows, adding TERMINATED_BY=EOF to not split fields on a comma and leaving the other parameters as described in the mentioned article:
sqlldr MYDB/MyPwd TABLE=MYTBL DATA='c:\setup\input.txt' DIRECT=TRUE MULTITHREADING=TRUE PARALLEL=TRUE TERMINATED_BY=EOF
The number of inserted rows will be the same as earlier, in my case it was 72. Once the data is loaded, you need to install and configure Oracle Text as follows (I used this guide as reference):
Step 1 - If your language won't be English, search for your language in $ORACLE_HOME/ctx/admin/defaults, for example to search for Spanish you can run the command below:
Get-ChildItem "C:\app\Oracle19c\ctx\admin\defaults" | Select-String "SPANISH" | Select Filename, LineNumber, Line, Path | Format-Table
Step 2 - Connect AS SYSDBA and run the install script $ORACLE_HOME/ctx/admin/catctx, replacing CTXSysPwd with the password you want to assign to a new schema named CTXSYS, replacing SYSAUX with the tablespace where you will create the new objects, replacing SYSTEM with your TEMP TABLESPACE (if you didn't specify it when creating the database it will be SYSTEM), and specifying NOLOCK to unlock the schema; once it completes verify there were no errors, in my case the command is as follows:
SQL> @C:\app\Oracle19c\ctx\admin\catctx.sql CTXSysPwd SYSAUX SYSTEM NOLOCK;
Step 3 - Change schema to CTXSYS (in earlier versions you had to login as this user) and run the default preferences script for your language (the one you found in step 1), for English the script is $ORACLE_HOME/ctx/admin/defaults/drdefus.sql as follows, then verify there were no errors:
SQL> ALTER SESSION SET CURRENT_SCHEMA=CTXSYS; SQL> @C:\app\Oracle19c\ctx\admin\defaults\drdefus.sql;
Step 4 - As SYS verify Oracle Text was installed successfully with the commands and output shown below:
SQL> ALTER SESSION SET CURRENT_SCHEMA=SYS; SQL> SELECT COMP_NAME, STATUS, SUBSTR(VERSION,1,10) AS VERSION FROM DBA_REGISTRY WHERE COMP_ID = 'CONTEXT'; COMP_NAME STATUS VERSION ------------ ----------- ---------- Oracle Text VALID 19.0.0.0.0 SQL> SELECT VER_DICT, VER_CODE FROM CTXSYS.CTX_VERSION; VER_DICT VER_CODE ----------- ---------- 19.0.0.0.0 19.0.0.0.0 SQL> SELECT CTXSYS.DRI_VERSION FROM DUAL; DRI_VERSION ----------- 19.0.0.0.0 SQL> SELECT COUNT(1) FROM DBA_OBJECTS WHERE OWNER='CTXSYS'; COUNT(1) ---------- 398 SQL> SELECT OBJECT_TYPE, COUNT(1) FROM DBA_OBJECTS WHERE OWNER='CTXSYS' GROUP BY OBJECT_TYPE; OBJECT_TYPE COUNT(1) ----------------------- ---------- INDEX 68 PACKAGE BODY 67 TYPE BODY 6 INDEXTYPE 5 PACKAGE 78 PROCEDURE 2 LIBRARY 1 FUNCTION 2 TYPE 22 OPERATOR 6 LOB 4 SEQUENCE 3 VIEW 81 TABLE 53 SQL> SELECT OBJECT_NAME, OBJECT_TYPE, STATUS FROM DBA_OBJECTS WHERE OWNER='CTXSYS' AND STATUS != 'VALID' ORDER BY OBJECT_NAME; no rows selected
Step 5 - Grant Oracle Text permissions to the schema owner of the table loaded earlier with the commands below, replacing MYDB with your schema owner:
GRANT CTXAPP TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_CLS TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_DDL TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_DOC TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_OUTPUT TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_QUERY TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_REPORT TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_THES TO MYDB; GRANT EXECUTE ON CTXSYS.CTX_ULEXER TO MYDB;
Step 6 - Connect with your schema owner and create the index with the commands below, replacing IDX_TWEETS with your index name, MYTBL with the table loaded earlier, and TEXT with the column name of the table:
SQL> connect MYDB/MyPwd SQL> CREATE INDEX IDX_TWEETS ON MYTBL(TEXT) INDEXTYPE IS CTXSYS.CONTEXT;
Note there are other index types and you can specify additional properties when creating, but this is the simplest form.
Perform Test Searches in Oracle
Once Oracle Text is installed and configured, you can perform text searches as follow. To see how many records contain the exact word you extracted data for, see the command and output below, by default the word is case insensitive:
SQL> SELECT COUNT(1) FROM MYTBL WHERE CONTAINS(TEXT, 'sql') > 0; COUNT(1) ---------- 69
To see how many records sound like the word you extracted data for (soundex), see the command and output below:
SQL> SELECT COUNT(1) FROM MYTBL WHERE CONTAINS(TEXT, '!sql') > 0; COUNT(1) ---------- 72
To highlight (html bold) the words with a common root (stemming) run the commands and see its output below, in my case because I didn't create a primary key, I had to use ROWID but CTX_DOC.MARKUP accepts the index ID:
EXEC CTX_DOC.SET_KEY_TYPE('ROWID'); SET SERVEROUTPUT ON; DECLARE mklob CLOB; amt NUMBER := 4000; line VARCHAR2(4000); BEGIN FOR I IN (SELECT ROWID FROM MYTBL WHERE CONTAINS(TEXT, '$give') > 0) LOOP CTX_DOC.MARKUP('IDX_TWEETS', I.ROWID, '$give', mklob, tagset => 'HTML_DEFAULT'); DBMS_LOB.READ(mklob, amt, 1, line); DBMS_OUTPUT.PUT_LINE(line); END LOOP I; DBMS_LOB.FREETEMPORARY(mklob); END; / Sadly, a significant portion of dirty money has been <B>given</B> to political parties, such as X <B>gave</B> up, is on the horizon. He?s wholly owned by. WEAKNESS???When bounties on the heads of troops ignored it? <B>gave</B> classified? https://t Used to be the indefensible Capital that <B>gives</B> it's name? http://
Get Performance Stats in Oracle
You can get performance stats when using CONTAINS with the commands and output below:
EXPLAIN PLAN FOR SELECT TEXT FROM MYTBL WHERE CONTAINS(TEXT, '%BBC%') = 0; SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY()); ------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 121 | 1 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| MYTBL | 1 | 121 | 1 (0)| 00:00:01 | |* 2 | DOMAIN INDEX | IDX_TWEETS | | | 1 (0)| 00:00:01 | ------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("CTXSYS"."CONTAINS"("TEXT",'%BBC%')=0)
And then compare it against the performance of using LIKE with the commands and output below, note LIKE always performs a full table scan under these conditions:
EXPLAIN PLAN FOR SELECT TEXT FROM MYDB.MYTBL WHERE TEXT LIKE '%BBC%'; SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY()); --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 4 | 484 | 3 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| MYTBL | 4 | 484 | 3 (0)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("TEXT" LIKE '%BBC%' AND "TEXT" IS NOT NULL)
You can also get the text index explained with the commands and output below, this can help you improve your search queries:
CREATE TABLE CTX_EXPLAIN( EXPLAIN_ID VARCHAR2(30), ID NUMBER, PARENT_ID NUMBER, OPERATION VARCHAR2(30), OPTIONS VARCHAR2(30), OBJECT_NAME VARCHAR2(64), POSITION NUMBER, CARDINALITY NUMBER); EXEC CTX_QUERY.EXPLAIN( - index_name => 'IDX_TWEETS', - text_query => '%BBC%', - explain_table => 'CTX_EXPLAIN', - sharelevel => 0); SELECT EXPLAIN_ID, ID, PARENT_ID, OPERATION, OPTIONS, OBJECT_NAME, POSITION FROM CTX_EXPLAIN ORDER BY ID; EXPLAIN_ID ID PARENT_ID OPERATION OPTIONS OBJECT_NAME POSITION ---------- ---------- ---------- --------------- ---------- --------------- ---------- 1 0 WORD BBCNEWS 1
Additional notes:
- CTXSYS.CONTEXT is not the only index type, there are other types for other purposes. For this index type, you need to synchronize the index every time you insert/update/delete records.
- You can tune everything in the word extraction process to accommodate your needs.
- You can add sub-indexes to improve the execution plan when searching/ordering using other columns, they can be any data type.
- You can find Oracle Text Application Developer Guide in this link.
- You can find Oracle Text Reference in this link.
SQL Server Full Text Search
In the article "SQL Server vs Oracle: Data Loading" you learned how to generate data and insert into a SQL Server database, but Full Text requires a unique key defined. If your data has been already loaded, you need to add a column which will serve as the unique key with the command below, it is recommended to be integer so SQL Server doesn't maintain a separate key mapping table:
ALTER TABLE MyDb.dbo.MyTbl ADD ID INT IDENTITY(1,1) NOT NULL;
Then you need to add a unique key with the command below, note if you need additional performance then create it as clustered:
CREATE UNIQUE INDEX ID_TWEET ON MyDb.dbo.MyTbl(ID);
Once that is done, you need to install and configure Full Text as follows:
Step 1 - Add the feature to SQL Server: this is accomplished by running the SQL Server installer as described in "Install SQL Server and Oracle using PowerShell and Windows Containers" adding the FullText feature; if it's already installed, this will only add the missing feature and everything else will be the same, if you remove a feature from the list then it will be deinstalled; if it succeeds you will see near the end "Setup result: 0":
& "c:\setup\setup.exe" /Q /Action=Install /IAcceptSQLServerLicenseTerms /IndicateProgress /Features=SQLEngine,Conn,FullText /InstanceName=MSSQLSERVER /TcpEnabled=1 /SecurityMode=SQL /SaPwd=@Sq1T3st /SqlSysAdminAccounts="ContainerAdministrator"
Step 2 -Verify there is now a service named MSSQLDFLauncher (but this name changes if it's a named instance):
PS C:\> get-service -name *sql* Status Name DisplayName ------ ---- ----------- Running MSSQLFDLauncher SQL Full-text Filter Daemon Launche...
Step 3 - Create a fulltext catalog in your database with the command below, changing TWEETS with the name of the catalog:
CREATE FULLTEXT CATALOG TWEETS;
Step 4 - Create a fulltext index in your table with the command below, changing MyTbl with your table name, Text with your column name, ID_TWEET with the index name you added earlier, TWEETS with the catalog name, and AUTO if you want the index population to be manual:
CREATE FULLTEXT INDEX ON dbo.MyTbl(Text) KEY INDEX ID_TWEET ON TWEETS WITH CHANGE_TRACKING AUTO;
You can also indicate the file type through its file extension in the same table, so you can have multiple document types together.
Perform Test Searches in SQL Server
Once Full Text is installed and configured, you can perform text searches as follow. To see how many records contain the exact word you extracted data for, see the command and output below, by default the word is case insensitive:
1> SELECT COUNT(1) FROM dbo.MyTbl WHERE CONTAINS(Text, 'sql'); 2> GO ----------- 64
To see how many records match the meaning but not the exact wording, see the command and output below:
1> SELECT COUNT(1) FROM dbo.MyTbl WHERE FREETEXT(Text, 'sql'); 2> GO ----------- 68
To see the words with a common root (stemming), see the command and output below:
1> SELECT Text FROM dbo.MyTbl WHERE FREETEXT(Text, 'FORMSOF(INFLECTIONAL, "give")'); 2> GO Sadly, a significant portion of dirty money has been given to political parties, such as X gave up, is on the horizon. He?s wholly owned by. WEAKNESS???When bounties on the heads of troops ignored it? gave classified? https://t Used to be the indefensible Capital that gives it's name? http://
Get Performance Stats in SQL Server
You can get the performance of using CONTAINS in SQL Server Management Studio:
And then compare it against the performance of using LIKE, note LIKE always performs a full table scan under these conditions:
You can also test the full text with the command below, where 1033 is the locale ID for English, the third parameter is the default stoplist ID, and the last parameter is the accent sensitivity (1 for sensitive); from the result, the expansion_type indicates it is a single word:
Additional notes:
- There is no way to perform a SOUNDEX within the Full Text index, but you can get the words in the index, run SOUNDEX on them, and then join the result back with the original table.
- There is no way to highlight the words that match the query.
- You can tune everything in the word extraction process to accommodate your needs.
- Named pipes needs to be enabled with the default value and must be started only by the SQL Server service. The service start type is Manual.
- You can only define one full text index per table.
- You need to update statistics on the unique index before a full text index population to generate good partitions.
- When upgrading your database, consider a full rebuild since newer versions introduce new word breakers and stemmers, and more advanced linguistic behavior.
- By default, noise words are not removed from full text queries and throw
a warning if you include them, but this can be changed with
sp_configure 'transform noise words', 1
. - You can adjust synonyms and replacements in the thesaurus file for the language
used, but it is stored in cache, so you need to reload it using
sp_fulltext_load_thesaurus_file
; for example, use 1033 to reload changes made in Tsenu.xml used for English. Note the thesaurus rules are not recursive, you must avoid special characters, and you need to save the file as Unicode specifying Byte Order Marks. - Changes made with
WRITETEXT
andUPDATETEXT
are not picked up by full text auto population, and you need to enable trace flag 7646 to alleviate blocking between DML and queries. For better performance, it is recommended to populate using incremental mode which requires a timestamp column and a secondary index on it. - It uses an algorithm to treat 50-word documents the same as 100-word documents, so if a word repeats the same number of times, a small document will have a higher rank than larger documents.
- You can view the available languages in
sys.fulltext_languages
. You can load additional OS languages withEXEC sp_fulltext_service @action='load_os_resources', @value=1
. - If you have plenty of memory and very complex full text queries, enable trace flag 7662 to disable slower processing mode when more than 20 MB of memory is required (fallback mechanism).
- The full text indexes are not stored and updated individually nor per-table; instead, the full text catalog is treated as a whole. Check this link for full text catalog recommendations and a script to distribute full text indexes across multiple catalogs.
- If the user queries are not returning data, you may need to implement a fix to repopulate full text indexes as described in this link.
- You can find the Full Text documentation in this link.
Conclusion
You can see SQL Server Full Text and Oracle Text are very similar in how they work, how they must be configured, and what their results are; also, both must be either installed or created using a script. In Oracle there are more usage features and it need less administration. In SQL Server it is managed per database, so it doesn't need to be configured per user, but has more caveats.
Next Steps
- Here is official documentation covering full text:
- Here are several links that you may find useful regarding full text:
- SQL Server Full Text Indexes
- Full Text Search – How to install, configure and use it with SQL Server (and Outsystems)
- Understanding Full-Text Indexing in SQL Server
- Full-Text Indexing Workbench
- Searching for Strings in SQL Server Databases
- Full Text Searches on Documents in FileTables
- SQL Server Full Text Search Language Features
- SQL Server Full Text Search Language Features Part 2
- Exploring Semantic Search Key Term Relevance
- Check out these tips on full text search.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2022-05-20