SQL Server Semantic Search to Find Text in External Files

By:   |   Updated: 2012-10-04   |   Comments (3)   |   Related: > Full Text Search


Problem

Sometimes it is necessary to search for specific content inside documents stored in a SQL Server database. Is it possible to do this in SQL Server? Can I run T-SQL queries and find content inside Microsoft Word files?  Yes, now with SQL Server 2012 you can do a semantic search. You can create a T-SQL query and look for information inside a Word document or other documents.  In this tip, we will look at how this works.

Solution

SQL Server 2012 introduces a new search feature beyond the relational platform called semantic search. This feature extends the semantic full-text search capabilities to external files.

Installation

The Semantic Search engine can be installed when you do the initial installation of SQL Server or it can be added later.

Here is a screenshot of the option when you do a normal installation:

install full text and semantic search

If you did not install this when you did the initial installation you can run the SQL Server installer for the SemanticLanguageDatabase as shown below. This can be found in the SQL Server setup folder.

Open the SQL Server Management Studio

Verify Features Are Installed

To verify if you have Full Text and Semantic Search installed you can run this query:

SELECT SERVERPROPERTY('IsFullTextInstalled')

The following query s used to verify that the semantic database is installed and registered:

SELECT * FROM sys.fulltext_semantic_language_statistics_database

Create Sample Database

To use this new feature we are going to use the FileTable feature in SQL Server 2012.  First I am going to create a folder where we will store our database files.  In this example I have created a folder called "C:\FileTable".

After this folder is created, open SQL Server Management Studio and create a new database called "FileSearchTest".  This is created just like any other database except we are going to also enable FILESTREAM for this database as well. 

The script below creates the database:

CREATE DATABASE FileSearchTest
ON PRIMARY
(
    NAME = N'FileSearchTest',
    FILENAME = N'C:\FileTable\FileSearchTest.mdf'
),
FILEGROUP FilestreamFG CONTAINS FILESTREAM
(
    NAME = MyFileStreamData,
    FILENAME= 'C:\FileTable\Data'
)
LOG ON
(
    NAME = N'FileSearchTest_Log',
    FILENAME = N'C:\FileTable\FileSearchTest_log.ldf'
)
WITH FILESTREAM
(
    NON_TRANSACTED_ACCESS = FULL,
    DIRECTORY_NAME = N'FileTable'
)

Create Sample Table

Create a new table as a FileTable inside the database.

USE FileSearchTest
go
CREATE TABLE DocumentSemantics AS FileTable
WITH
(
    FileTable_Directory = 'DocumentSemantics',
    FileTable_Collate_Filename = database_default
);
GO

We just created a FileTable! This is a special table that is used to store non-relational data like Word files, text files, ppt files, etc... The below image shows this new feature in SSMS.

Open the SQL Server Management Studio

There is also a feature in SQL Server 2012 to explore the FileTable directory from within SSMS.  In SSMS right click on the new FileTable you can see the "Explore FileTable Directory" option.  This will show you the files that exist in this folder.

Open the SQL Server Management Studio

You can now copy files into the folder "C:\FileTable\Data", using Windows Explorer and then if you run a SELECT against the DocumentSemantics FileTable you can see a list of these files.

You can just run this command to see all of the details:

SELECT * FROM dbo.DocumentSemantics

Open the SQL Server Management Studio

Deleting Files

You can also delete files from the folder using T-SQL as shown below:

USE FileSearchTest
DELETE FROM dbo.DocumentSemantics
WHERE dbo.DocumentSemantics.file_type <> 'docx'

Searching Content Within Files

In this example we are going to search for text and find the files with that content:

USE FileSearchTest
GO
SELECT name
FROM dbo.DocumentSemantics
WHERE FREETEXT (file_stream, 'After climbing a very high mountain, we discovered that there are many other mountains to climb')

The query will show the name of the documents that contain the following text: "After climbing a very high mountain, we discovered that there are many other mountains to climb". 

The results displayed are the following:

Open the SQL Server Management Studio

If we open these documents in WordPad we can see this text as outlined below.

Open the SQL Server Management Studio

Finding Documents By Filename

Another option we have is to find filenames that include some text.  In this example we are going to search for the word "system".

These are the files in folder "C:\FileTable\Data" that include the word "system":

Open the SQL Server Management Studio

We can run this T-SQL script to find the first 5 files that contain the word "system" in the filename:

 USE FileSearchTest
SELECT TOP (5)
 D.name,
 D.cached_file_size,
 KEYP_TBL.score
FROM dbo.DocumentSemantics D
INNER JOIN SEMANTICKEYPHRASETABLE
 (
 dbo.DocumentSemantics, 
 (name, file_stream)
 ) AS KEYP_TBL
 ON D.path_locator = KEYP_TBL.document_key
WHERE KEYP_TBL.keyphrase ='system'
ORDER BY KEYP_TBL.score DESC;

The results displayed by the query are as follows:

Open the SQL Server Management Studio
Next Steps

Semantic Search combined with FileTables is a powerful feature that will let us handle external files in SQL Server easily and we can query the content easily using T-SQL.

For more information, you can read these artilces:



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Daniel Calbimonte Daniel Calbimonte is a Microsoft SQL Server MVP, Microsoft Certified Trainer and 6-time Microsoft Certified IT Professional. Daniel started his career in 2001 and has worked with SQL Server 6.0 to 2022. Daniel is a DBA as well as specializes in Business Intelligence (SSIS, SSAS, SSRS) technologies.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2012-10-04

Comments For This Article




Friday, August 1, 2014 - 9:58:37 AM - Anonymous Back To Top (33975)

I want to do "searching content within files" for PDF documents. I searched a lot.According to that I found ifilter can do this.I tried that. ifilter installed path is also visible in sql server query.But when use query to search content, doc types are visible.But not pdf documents.

Can anybody help me.plzz  

 


Thursday, July 31, 2014 - 6:05:20 AM - Anonymous Back To Top (33954)

Excellent post !! It works nicely.Thanks a lot  :)

 


Wednesday, October 24, 2012 - 4:57:13 AM - Yuval Klein Back To Top (20065)

excelent post! 

the filetable feature with the FTS is really an excelent feature.

in the previous versions I needed to translate the word documents to text , load them into table and search inside. it was really a lot of job. now it looks so easy!















get free sql tips
agree to terms