By: Pablo Echeverria | Updated: 2022-06-21 | Comments (1) | Related: > Full Text Search
Problem
As a follow up to SQL Server vs Oracle: Data Loading we want to perform sentiment analysis, but before doing that it's important to understand the data. In the article Comparing SQL Server Full Text Search and Oracle Text Search we were able to search for common words, misspellings, and stemming (root and derivatives of a word), and now it's time to let the database tell us automatically which common phrases it's able to identify and how often they repeat (not identical but very similar). How is this performed in SQL Server, in Oracle, and what are their differences?
Solution
SQL Server and Oracle Semantic Search allow you to perform the following natural language understanding and processing: derive user intents and act if they match a query, sentiment analysis against a pre-defined dictionary-based sentiment, text mining, rank documents, key tags extraction, and related content discovery.
It allows you to perform statistical semantic (since a word can have multiple meanings or it changes within the context or depending on the surrounding words): return a scored table with semantic key tags, return a scored table with documents like one specified, and return the key tags that make the documents semantically similar.
Remember we're analyzing tweets, by getting key tags you can see which of them are the most influential over others and can also help us identify and remove bot/botnet/netcenter tweets that can deviate perception.
Note: SQL Server Semantic Search only indexes single words, not multi-word phrases (N-grams), and inflectional forms and synonyms are indexed separately.
In the next sections you will see how this is enabled for SQL Server and Oracle, and the differences.
Oracle
To install Oracle Text Search, you must execute the following steps:
Step 1 - If your database is new, because it uses Java classes you need to increase the JAVA_POOL_SIZE from the default 4 MB to 256 MB, and because it uses cursors/stored procedures/control structures/parallel execution if you have not set SGA_TARGET you need to increase SHARED_POOL_SIZE from the default 172 MB to 512 MB, this is done with the PowerShell command below and after that you need to bounce the instance performing a "shutdown immediate" and a "startup"; in the end you will increase the memory usage to about 600 MB:
Clear-Content C:\app\Oracle19c\database\INITORCL.ORA $content=@" db_name=ORCL java_pool_size=256M shared_pool_size=512M "@ Add-Content C:\app\Oracle19c\database\INITORCL.ORA $content
Step 2 - If your database is new, because several objects will be created, you need to increase the datafiles for SYSTEM/SYSAUX/SYS_UNDOTS to prevent slowness and constant file increase operations, this is done with the SQL commands below connected as SYS:
ALTER DATABASE DATAFILE 1 RESIZE 800M; ALTER DATABASE DATAFILE 2 RESIZE 300M; ALTER DATABASE DATAFILE 3 RESIZE 500M;
Step 3 - Verify there are no invalid installed components; if there are, you need to reinstall it by running the appropriate script. This is verified running the SQL command below connected as SYS and confirmed with the output as shown, you need to run it every time you install a new component:
SET LINESIZE 300 COL COMP_ID FOR A7 COL COMP_NAME FOR A40 COL VERSION FOR A10 SELECT COMP_ID, COMP_NAME, VERSION, STATUS FROM DBA_REGISTRY; COMP_ID COMP_NAME VERSION STATUS ------- ---------------------------------------- ---------- ----------- CATALOG Oracle Database Catalog Views 19.0.0.0.0 VALID CATPROC Oracle Database Packages and Types 19.0.0.0.0 LOADED RAC Oracle Real Application Clusters 19.0.0.0.0 OPTION OFF JAVAVM JServer JAVA Virtual Machine 19.0.0.0.0 VALID XDB Oracle XML Database 19.0.0.0.0 VALID
Step 4 - Verify there are no invalid objects; if there are you need to change directory as "cd C:\app\Oracle19c" and run the script connected as SYS with "@?\rdbms\admin\utlrp.sql". This is verified by running the SQL command below connected as SYS and confirmed with the output as shown, and you need to run it every time you install a new component:
COL OWNER FOR A20 COL OBJECT_TYPE FOR A15 COL OBJECT_NAME FOR A30 SET PAGESIZE 500 SELECT OWNER, OBJECT_TYPE, OBJECT_NAME FROM DBA_OBJECTS WHERE STATUS <> 'VALID' ORDER BY OWNER, OBJECT_TYPE; no rows selected
Step 5 - Install Java Virtual Machine (JVM) component with the SQL commands below (it runs faster if you bounce the database before running it), otherwise you can't run SPARQL queries:
spool C:\setup\initjvm.out @?\javavm\install\initjvm.sql spool off
Once run, confirm there were no errors with the PowerShell command below; if there were errors you need to resolve them and reinstall the component:
Get-Content "C:\setup\initjvm.out" | Select-String "(ORA-)|(ERR)" | Select LineNumber, Line | Format-Table
There will be a new component installed as follows:
COMP_ID COMP_NAME VERSION STATUS ------- ---------------------------------------- ---------- ----------- JAVAVM JServer JAVA Virtual Machine 19.0.0.0.0 VALID
You also need to verify the Java option is enabled with the SQL command and output below:
SELECT * FROM V$OPTION WHERE PARAMETER='Java'; PARAMETER VALUE CON_ID ---------- ---------- ---------- Java TRUE 0
And verify it works with the SQL commands and output below:
SELECT DBMS_JAVA.LONGNAME('TEST') FROM DUAL; DBMS_JAVA.LONGNAME('TEST') -------------------------- TEST
Step 6 - Even if Oracle XML database (XDB) is installed, you need to install XML XDK component with the SQL commands below (it runs faster if you bounce the database before running it), otherwise the ORA-29549 error is returned:
spool C:\setup\xdk.out @?\xdk\admin\initxml.sql spool off
Once run, confirm there were no errors in the output file. There will be a new component installed as follows:
COMP_ID COMP_NAME VERSION STATUS ------- ---------------------------------------- ---------- ----------- XML Oracle XDK 19.0.0.0.0 VALID
Step 7 - Install Spatial and Graph (SDO) with the SQL commands below (it runs faster if you bounce the database before running it):
spool C:\setup\mdinst.out @?\md\admin\mdinst.sql spool off
Once run, confirm there were no errors in the output file. In my case the following 4 files contained invalid characters and returned ORA-29913 and ORA-30653 so I had to copy them to my host machine, open in Notepad++, make a change like add a white space and delete it, save the files, and copy them back to the original location in the Docker container, these files are used in external table imports:
C:\app\Oracle19c\md\admin\sdo_coord_ref_sys.txt C:\app\Oracle19c\md\admin\sdo_coord_ops.txt C:\app\Oracle19c\md\admin\sdo_coord_op_param_vals.txt C:\app\Oracle19c\md\admin\cs_srs.txt
There will be a new component installed as follows:
COMP_ID COMP_NAME VERSION STATUS ------- ---------------------------------------- ---------- ------------ SDO Spatial 19.0.0.0.0 VALID
And you can verify the installed SDO components with the SQL commands and its output below:
COL NAMESPACE FOR A10 COL ATTRIBUTE FOR A11 COL VALUE FOR A10 COL DESCRIPTION FOR A40 SELECT NAMESPACE, ATTRIBUTE, VALUE, DESCRIPTION FROM MDSYS.RDF_PARAMETER; NAMESPACE ATTRIBUTE VALUE DESCRIPTION ---------- ----------- ---------- ---------------------------------------- COMPONENT RDFCTX INSTALLED Semantic (Text) Search component COMPONENT RDFOLS INSTALLED RDF Optional component for OLS support MDSYS SEM_VERSION 19.1.0.0.0 VALID
Test Oracle Semantic Search
Once Semantic Search is installed, you can perform a test with the steps below, part of the instructions were taken from Oracle document "How To Use The GATE Extractor To Create SemContext Indexes (Doc ID 1550829.1)" which is outdated:
Step 1 - In your host machine, install General Architecture for Text Engineering (GATE) which is an open-source natural language processor and information extractor, it can be downloaded from here and the installer is named gate-developer-9.0.1-installer.exe with size 57.9 MB. Note two folders will be created, "C:\Program Files (x86)\GATE_Developer_9.0.1" and "C:\Users\pabechevb\.m2\repository\uk\ac\gate\plugins\annie\9.1". Once installed, open the new installed application "GATE 9.0.1", click on File > Manage CREOLE Plugins > select ANNIE (9.1) > click on the button "Extract Plugin Resources" and select the C:\temp\ folder, there will be a file named "ANNIE_with_defaults.gapp" and it is used when launching a GateListener described later.
Step 2 - Create a listener that will receive requests from the database and will reply with the GATE response. The sample can be downloaded from here with filename gatelistener.zip and size 3.18 KB. However, this is for version 5 and won't work for version 9, instead use the attached file which I modified to make it work in version 9 named GateListener.java, then to compile it I placed it at C:\temp.
First you need to compile the file with the CMD commands below, note it requires JDK for the compiler so in my case I'm using the executable from the Oracle installer, also note you need to reference the GATE jar file:
cmd cd C:\temp "C:\temp\WINDOWS.X64_193000_db_home\jdk\bin\javac.exe" -classpath "C:\Program Files (x86)\GATE_Developer_9.0.1\lib\gate-core-9.0.1.jar" GateListener.java
It compiles the class in GateListener.java and outputs a file named GateListener.class, then you can start the listener with the CMD commands below, note I have JRE installed locally so this is the executable I'm using, and Windows Firewall will ask you to open the port which I left as 12000:
cmd cd C:\temp mkdir ANNIE copy ANNIE_with_defaults.gapp .\ANNIE "C:\Program Files (x86)\Java\jre1.8.0_333\bin\java.exe" -classpath .;"C:\Program Files (x86)\GATE_Developer_9.0.1\bin\*";"C:\Program Files (x86)\GATE_Developer_9.0.1\lib\*" GateListener 12000
Once the listener is running, you can test it is reachable from the Docker container with the PowerShell command below replacing the IP with your host computer IP; note this will crash the program and you need to restart the listener:
Test-NetConnection -ComputerName 192.168.0.4 -InformationLevel "Detailed" -Port 12000
Step 3 - Connect to the database with the SYSTEM user (SYS doesn't work because it has too many privileges, MDSYS is configured to not allow connecting to it, and a normal user lacks several privileges needed) and configure the GATE host and port with the SQL commands below, replacing the IP with your host machine IP; note this is done once per database, and if it's not configured you'll get the ORA-13199 error later:
begin sem_rdfctx.set_extractor_param ( param_key => 'GATE_NLP_HOST', param_value => '192.168.0.4', param_desc => 'Host for GATE NLP Listener'); sem_rdfctx.set_extractor_param ( param_key => 'GATE_NLP_PORT', param_value => '12000', param_desc => 'Port for Gate NLP Listener'); end; /
Step 4 - Grant access to the SYSTEM user to connect to external hosts with the SQL commands below, otherwise you'll get the ORA-24247 error later:
begin dbms_network_acl_admin.create_acl ( acl => 'SEMINDEX', description => 'Allow query SPARQL endpoints', principal => 'SYSTEM', is_grant => true, privilege => 'connect'); dbms_network_acl_admin.assign_acl ( acl => 'SEMINDEX', host => '*'); end; /
You can verify it is created with the SQL commands below:
COL PRINCIPAL FOR A10 COL HOST FOR A12 COL LOWER_PORT FOR 99999 COL UPPER_PORT FOR 99999 COL ACL FOR A30 COL PRIVILEGE FOR A9 COL PRIVILEGE_STATUS FOR A16 SET LINESIZE 300 SET PAGESIZE 500 SELECT PRINCIPAL, HOST, LOWER_PORT, UPPER_PORT, ACL, 'CONNECT' AS PRIVILEGE, DECODE(DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE_ACLID(aclid, PRINCIPAL, 'connect'), 1,'GRANTED', 0,'DENIED', NULL) PRIVILEGE_STATUS FROM DBA_NETWORK_ACLS JOIN DBA_NETWORK_ACL_PRIVILEGES USING (ACL, ACLID) UNION ALL SELECT PRINCIPAL, HOST, NULL lower_port, NULL upper_port, acl, 'resolve' AS PRIVILEGE, DECODE(DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE_ACLID(aclid, PRINCIPAL, 'resolve'), 1,'GRANTED', 0,'DENIED', NULL) PRIVILEGE_STATUS FROM DBA_NETWORK_ACLS JOIN DBA_NETWORK_ACL_PRIVILEGES USING (ACL, ACLID);
And you can test with the SQL commands below (remember to restart the listener after it crashes):
DECLARE l_conn Sys.UTL_TCP.connection; BEGIN l_conn := utl_tcp.open_connection(remote_host =>'192.168.0.4',remote_port => 12000); utl_tcp.close_connection(l_conn); END; /
Step 5 - Create a semantic network with the SQL commands below, specifying the tablespace where to create it:
EXECUTE SEM_APIS.CREATE_SEM_NETWORK(TABLESPACE_NAME => 'SYSTEM');
Then create a Semantic Text policy to use GATE extractor with the SQL commands below, the extractor can be tuned and can be extended with additional ontologies but this is the simplest form:
begin sem_rdfctx.create_policy (policy_name => 'SEM_EXTR', extractor => mdsys.gatenlp_extractor()); end; /
Step 6 - After the data has been imported as shown in Comparing SQL Server Full Text Search and Oracle Text Search, create the semantic index with the SQL commands below, you will notice activity in the listener:
CREATE INDEX RDFINDEX ON MYDB.MYTBL(TEXT) INDEXTYPE IS MDSYS.SemContext PARAMETERS('SEM_EXTR');
Verify there were 0 errors during the index creation with the SQL command below:
select count(*) from mdsys.rdfctx_index_exceptions;
Now you need the semantic model name created with the SQL query below and its output as shown:
SELECT owner, model_id, model_name, table_name, column_name FROM MDSYS.SEM_MODEL$ WHERE model_name like 'RDFCTX%'; OWNER MODEL_ID MODEL_NAME TABLE_NAME COLUMN_NAM ---------- ---------- --------------- ---------- ---------- SYSTEM 1 RDFCTX_MOD_1
And you can start running SPARQL queries, replacing the model's name as returned from the previous query, to get every relation created:
SELECT s, p, o FROM TABLE(SEM_MATCH( '(?s ?p ?o)', SEM_Models('RDFCTX_MOD_1'), null, null, null));
And you can query which records match any of the relations with the SPARQL query below:
SELECT TEXT FROM MYDB.MYTBL WHERE SEM_CONTAINS(TEXT, 'SELECT ?s ?p ?o WHERE {?s ?p ?o}', 1) = 1;
There are additional operators, like SEM_CONTAINS_SELECT which returns additional information about each document, and SEM_CONTAINS_COUNT which counts matching subgraphs for the pattern specified. You can also add annotations to the documents using SEM_RDFCTX.MAINTAIN_TRIPLES procedure. And Oracle can extract plain text version from formatted documents like Word, RTF, PDF using filters and CTX_DOC.POLICY_FILTER which requires Oracle Text installed.
The official documentation can be found in this link, this link and this link.
SQL Server
In the article Comparing SQL Server Full Text Search and Oracle Text Search you learned how to install Full Text Search, which is a prerequisite for Semantic Search and allows you to classify and compare documents no matter their type. To install semantic search, you must execute the following additional steps:
Step 1 - Find the SemanticLanguageDatabase in the SQL Server installation media with the command and output below:
Get-ChildItem -Path C:\setup -Filter SemanticLanguageDatabase.msi -Recurse -ErrorAction SilentlyContinue -Force Directory: C:\setup\1033_ENU_LP\x64\Setup Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 9/24/2019 10:53 PM 185946112 SemanticLanguageDatabase.msi Directory: C:\setup\x64\Setup Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 9/24/2019 11:33 PM 185946112 SemanticLanguageDatabase.msi
Step 2 - Run the installer, which is going to decompress the database files with the command below, there is no output:
& C:\setup\x64\Setup\SemanticLanguageDatabase.msi
Step 3 - Search for the decompressed database files with the command and output below:
Get-ChildItem -Path C:\ -Filter semanticsdb.mdf -Recurse -ErrorAction SilentlyContinue -Force Directory: C:\Program Files\Microsoft Semantic Language Database Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 9/24/2019 2:21 PM 226426880 semanticsDB.mdf
Step 4 - Attach the database to the server instance with the command and output below:
CREATE DATABASE semanticsdb ON ( FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb.mdf' ) LOG ON ( FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb_log.ldf' ) FOR ATTACH; GO Converting database 'semanticsdb' from version 855 to the current version 904. Database 'semanticsdb' running the upgrade step from version 855 to version 856. … Database 'semanticsdb' running the upgrade step from version 903 to version 904.
Step 5 - Register the semantic statistics database with the command below, there is no output:
EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = N'semanticsdb';
Step 6 - Verify the semantic database is registered and check its version with the command and output below:
SELECT * FROM sys.fulltext_semantic_language_statistics_database; GO database_id register_date registered_by version ----------- ----------------------- ------------- --------------- 6 2022-04-08 13:47:43.470 1 11.0.1153.1.3
Step 7 - Verify the installed document parsers with the command below:
SELECT * FROM sys.fulltext_document_types;
Step 8 - By default Semantic Search doesn't have Office 2007-2010 parsers installed, you need to download them, the file is named "FilterPack64bit.exe" with size 3.87 MB and its version is 2010, you need to install it with the commands below, there is no output and the instructions are described in this link:
cmd c:\setup\FilterPack64bit.exe exit
Step 9 - You also need to install the PDF parser, you need to download it, the file is named "PDFFilter64Setup.msi" with size 19.6 MB and its version is 11.0.1.36, you need to install it with the command below, there is no output:
& c:\setup\PDFFilter64Setup.msi
Step 10 - Once the new filters are installed, you need to load them into Semantic Search with the commands below, there is no output:
EXEC sp_fulltext_service 'update_languages'; GO EXEC sp_fulltext_service 'load_os_resources', 1; GO EXEC sp_fulltext_service 'restart_all_fdhosts'; GO
Then you can query again the installed document parsers as before and see their location and version number.
Now you can alter the existing full text index, but in my case, I'm going to drop and recreate it with the commands below, note the only difference with a full text index is "STATISTICAL_SEMANTICS":
USE MyDb GO DROP FULLTEXT INDEX ON [dbo].[MyTbl]; GO CREATE FULLTEXT INDEX ON dbo.MyTbl(Text STATISTICAL_SEMANTICS) KEY INDEX ID_TWEET ON TWEETS WITH CHANGE_TRACKING AUTO;
Test SQL Server Semantic Search
You can find key phrases in all tweets with the query below, they are called phrases but in reality, those are single words, it is ordered by score so the most important words appear at the top:
SELECT column_id, document_key, keyphrase, score FROM SEMANTICKEYPHRASETABLE(MyTbl, *) ORDER BY score DESC;
You can find the most important key phrases in all tweets with the query below, it is ordered by count, so the most common words appear at the top:
SELECT keyphrase, COUNT(1) FROM SEMANTICKEYPHRASETABLE(MyTbl, *) GROUP BY keyphrase ORDER BY 2 DESC;
You can find similar or related tweets with the query below, because it is done per document ID we need to store intermediate results and this query may not work with large datasets so you will need to run it in batches:
CREATE TABLE #Info ([SourceDocumentKey] INT, [SourceColumnId] INT, [MatchedColumnId] INT, [MatchedDocumentKey] INT, [Score] REAL) DECLARE @cmd VARCHAR(MAX) SET @cmd = '' SELECT @cmd = @cmd + 'INSERT #Info EXEC('' SELECT '+CAST([ID] AS VARCHAR(10))+', * FROM SEMANTICSIMILARITYTABLE(MyTbl, *, '+CAST([ID] AS VARCHAR(10))+')'');' FROM [MyTbl]; EXEC (@cmd) SELECT s.Text, m.Text, Score FROM #Info [i] INNER JOIN MyTbl s on s.ID = i.SourceDocumentKey INNER JOIN MyTbl m on m.ID = i.MatchedDocumentKey WHERE Score > 0.5 ORDER BY Score DESC; DROP TABLE #Info
In my case there were no results because I'm selecting tweets that are at least 50% similar and there were none. This tells me in this dataset there are no tweets influencing others and they're not generated by bots/botnets/netcenters.
You can find the key phrases that make two documents similar or related, which means their words repeat often, with the query below; MyTbl is the loaded table, Text is the column for both comparisons, and 39 and 56 are the document IDs in the database:
SELECT keyphrase, score FROM SEMANTICSIMILARITYDETAILSTABLE(MyTbl, Text, 39, Text, 56);
The official documentation can be found here: Semantic search (contains several subtopics), Semantic search DDL, functions, stored procedures and views, Exploring Semantic Search Key Term Relevance.
Conclusion
You can see installing Semantic Search is a complex topic both in Oracle and SQL Server. In Oracle this doesn't depend on Oracle Text but having both installed allows you to perform additional actions, and SPARQL adds another level of difficulty. In SQL Server the queries remain as T-SQL and are easy to perform, but the database and Office extractors are a little out of date, so you need to test them first.
Next Steps
- Here is the official documentation on Oracle Semantic Search: this link, this link and this link.
- Here is the official documentation on SQL Server Semantic Search: Semantic search (contains several subtopics), Semantic search DDL, functions, stored procedures and views, Exploring Semantic Search Key Term Relevance.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2022-06-21