By: Daniel Farina | Updated: 2013-12-24 | Comments (6) | Related: > In Memory OLTP
Problem
SQL Server 2014 introduces hash indexes for Memory-Optimized tables, but you may not know what they are and the differences between hash indexes and standard SQL Server indexes. Also you may be asking why there are not clustered indexes on Memory-Optimized tables. Keep reading to find the answers.
Solution
We all know the importance of proper indexing on tables. The introduction of hash indexes for Memory-Optimized tables on SQL Server 2014 brings us new possibilities to achieve maximum performance, but there are some considerations. After reading this tip you will understand the pros and cons of this new type of index.
What is a hash index?
Basically, a hash index is an array of N buckets or slots, each one containing a pointer to a row. Hash indexes use a hash function F(K, N) in which given a key K and the number of buckets N , the function maps the key to the corresponding bucket of the hash index. I want to emphasize that the buckets do not store the keys or its hashed value. They only store the memory address in which the data is placed.
What is a hash function?
A hash function is any algorithm that maps data of variable length to data of a fixed length in a deterministic and close to random way. A very simple hash function would be a string that returns its length, so F("John") = 4 and F("Ed") = 2. If we define a hash index of 5 buckets using this function then the pointers that point to "John" and "Ed" are stored at buckets 4 and 2 respectively. The following image will help you understand.
At this point everything seems clear, but what happens when we apply this simple hash function on "Bill"? The length of string "Bill" is 4, so F("Bill") = 4 which is the same as F("John") = 4. Don't get confused, the fact that two keys have the same Hash value doesn't mean the keys are the same. This is called a "collision" and is very common in hash functions. Of course the more collisions a function has the worse a function is because a large number of hash collisions can have a performance impact on read operations. Also collisions may be caused when the number of buckets is too small compared with the number of distinct keys.
SQL Server Hekaton Collision Management
There are many ways to solve hash collisions, but I will focus on how the Hekaton engine handles them.
The way Microsoft has chosen to handle hash collisions is chaining with linked lists. In order to explain this concept I need to show you how data and indexes are organized on Memory-Optimized Tables.
SQL Server Hekaton Hash Indexes and Row Addressing
In my previous tip Overcoming storage speed limitations with Memory-Optimized Tables for SQL Server I briefly explained Hekaton's structure and now it is time to go deeper.
Memory-Optimized tables consist of two structures: indexes and rows. See image below.
The only way rows are combined into a table is by linking them with index pointers. I mean rows aren't necessarily stored in an orderly fashion. That's why clustered index are nonsense. On Memory-Optimized tables space is not pre-allocated. Instead when a row is inserted or updated the Hekaton engine allocates space from RAM's heap on the fly. Also you won't find covering indexes because for Memory-Optimized tables all indexes are inherently covered.
SQL Server Hekaton Row Header
To make it simple to understand I will omit some fields of the row header and will focus only on those that are more relevant for this tip.
As the default isolation level for Memory-Optimized tables is SNAPSHOT, it is easy to understand why the row header keeps timestamps of the rows in order to handle optimistic multiversion concurrency.
For new transactions a row is valid if the end timestamp is set to the special value referred as "infinity".
Remember when I told you about chaining to solve collisions? That's when index pointers for a row header come to light. When a row is inserted and hashes to a non-empty bucket Hekaton links the new row to the existing one by setting a pointer to the new row on the existing one.
Coming back to the previous example, suppose that at timestamp T0 = 10 we insert a row R1 with "John" as the key that maps to bucket number 4.
Then at timestamp T1 = 20 we insert a new row R2 with "Bill" as the key that also maps to bucket number 4. To handle this hash collision Hekaton will set a pointer on R1 to R2. See the next image to get a clearer vision of this scenario.
SQL Server Hekaton Row Chaining and Multiversioning
Now consider that we have the same row R1 whose key hash maps to bucket 4 at timestamp T0 = 10. Next at timestamp T2 = 30 the row R1 is updated. As I told you in my previous tip about Memory-Optimized tables, Hekaton handles updates as a delete followed by an insert, so updates don't modify the existing row data; instead, the engine sets the row end timestamp to T2 which makes the row outdated for new transactions and inserts a new row, R3 containing the updated data with a begin timestamp equal to T3. Look at the image below for a graphical explanation.
Now putting it all together: whenever a row's key matches an occupied bucket, either if it is a new row or an updated one, it is linked at the end of the pointer chain of the matching bucket.
This brings up this question: How does the engine identify the proper row?
SQL Server Hekaton Row Identification
Suppose we perform the following SELECT statement on a Hash indexed table.
SELECT * FROM Customer WHERE CustomerId = 'John'
To identify this row in a Hash indexed table by an INDEX SEEK operation the Hekaton engine uses the following algorithm:
1 - Calculate the bucket in which the key, if exists, should be stored.
To do so, the Engine applies the Hash function H(x) to the search predicate "John", so H("John") = h. Then, to get the bucket number (a number between 0 and BUCKET_COUNT − 1) Hekaton calculates the remainder of the division between h and the number of buckets. So:
Bucket = H("John") % BUCKET_COUNT
If you wonder why Hekaton rounds the number of buckets specified on BUCKET_COUNT to the next power of two, here lies the answer: if BUCKET_COUNT is a power of two then the modulo operation can be replaced by a masking operation. In computing, performing a division to obtain the modulo is more expensive (I mean it uses more cycles per instruction) than a masking operation.
Being BUCKET_COUNT is a power of two, the previous formula is rewritten as follows:
Bucket = H("John") & (BUCKET_COUNT -1)
Where & is a bitwise AND operation. You can test this math shortcut by yourself with the following SQL script.
DECLARE @BUCKET_COUNT INT = 8 SELECT 8458 % @BUCKET_COUNT SELECT 8458 & (@BUCKET_COUNT -1)
2 - After that the Engine checks at the given position if there is a pointer to a row. If so then it checks for a key value match and if the end timestamp isn't set.
3 - If row key doesn't match or end timestamp is set then the engine looks at the row's index pointer. If pointer is null then the searched key doesn't exist. Otherwise it repeats step 2 until the searched key is found or the end of pointer chain has been reached.
Here you have a flow chart showing these steps.
Pros and Cons of Hash Indexes in SQL Server Hekaton
Now I will show you with examples in which circumstances Hash indexes are the right choice. Also I will include screen captures with the execution plans so you can see the difference in query cost.
First we need to create our test database
CREATE DATABASE TestDB ON PRIMARY (NAME = TestDB_file1, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Data\TestDB_1.mdf', SIZE = 100MB, FILEGROWTH = 10%), FILEGROUP TestDB_MemoryOptimized_filegroup CONTAINS MEMORY_OPTIMIZED_DATA ( NAME = TestDB_MemoryOptimized, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Data\TestDB_MemoryOptimized') LOG ON ( NAME = TestDB_log_file1, FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Data\TestDB_1.ldf', SIZE = 100MB, FILEGROWTH = 10%) GO
Then we create some sample tables pairs.
USE TestDB GO IF OBJECT_ID('dbo.sample_memoryoptimizedtable_Hash','U') IS NOT NULL DROP TABLE dbo.sample_memoryoptimizedtable_Hash GO CREATE TABLE dbo.sample_memoryoptimizedtable_Hash ( c1 int NOT NULL, c2 float NOT NULL, c3 decimal(10, 2) NOT NULL CONSTRAINT PK_sample_memoryoptimizedtable_Hash PRIMARY KEY NONCLUSTERED HASH ( c1 )WITH ( BUCKET_COUNT = 1024) )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA ) GO ---------------------------------------------------------------- IF OBJECT_ID('dbo.sample_memoryoptimizedtable_Range','U') IS NOT NULL DROP TABLE dbo.sample_memoryoptimizedtable_Range GO CREATE TABLE dbo.sample_memoryoptimizedtable_Range ( c1 int NOT NULL, c2 float NOT NULL, c3 decimal(10, 2) NOT NULL CONSTRAINT PK_sample_memoryoptimizedtable_Range PRIMARY KEY NONCLUSTERED ( c1 ASC ) )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA ) GO /*----------------------------------------------------------------*/ /*----------------------------------------------------------------*/ IF OBJECT_ID('dbo.StringTable_Hash','U') IS NOT NULL DROP TABLE dbo.StringTable_Hash GO CREATE TABLE dbo.StringTable_Hash ( PersonID nvarchar(50) COLLATE Latin1_General_100_BIN2 NOT NULL , Name nvarchar(50) COLLATE Latin1_General_100_BIN2 NOT NULL, CONSTRAINT PK_StringTable_Hash PRIMARY KEY NONCLUSTERED HASH (PersonID) WITH (BUCKET_COUNT = 1024) ) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA) GO ---------------------------------------------------------------- IF OBJECT_ID('dbo.StringTable_Range','U') IS NOT NULL DROP TABLE dbo.StringTable_Range GO CREATE TABLE dbo.StringTable_Range ( PersonID nvarchar(50) COLLATE Latin1_General_100_BIN2 NOT NULL, Name nvarchar(50) COLLATE Latin1_General_100_BIN2 NOT NULL, CONSTRAINT PK_StringTable_Range PRIMARY KEY NONCLUSTERED (PersonID ASC) ) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA) GO /*----------------------------------------------------------------*/ /*----------------------------------------------------------------*/ IF OBJECT_ID('dbo.Composite_Key_Hash','U') IS NOT NULL DROP TABLE dbo.Composite_Key_Hash GO CREATE TABLE dbo.Composite_Key_Hash ( c1 int NOT NULL, c2 float NOT NULL, c3 decimal(10, 2) NOT NULL CONSTRAINT PK_Composite_Key_Hash PRIMARY KEY NONCLUSTERED HASH ( c1, c2 )WITH ( BUCKET_COUNT = 1024) )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA ) GO ---------------------------------------------------------------- IF OBJECT_ID('dbo.Composite_Key_Range','U') IS NOT NULL DROP TABLE dbo.Composite_Key_Range GO CREATE TABLE dbo.Composite_Key_Range ( c1 int NOT NULL, c2 float NOT NULL, c3 decimal(10, 2) NOT NULL CONSTRAINT PK_Composite_Key_Range PRIMARY KEY NONCLUSTERED ( c1 ASC, c2 ASC ) )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA ) GO
As you can see each table in the pair has the same structure, but differs on index type.
Now we load some test data.
USE TestDB GO INSERT INTO dbo.sample_memoryoptimizedtable_Hash(c1, c2, c3) SELECT 1, 1, 1 UNION ALL SELECT 2, 1, 1 UNION ALL SELECT 8, 1, 1 UNION ALL SELECT 3, 1, 1 UNION ALL SELECT 4, 1, 1 UNION ALL SELECT 5, 1, 1 UNION ALL SELECT 6, 1, 1 UNION ALL SELECT 7, 1, 1 -------------------------------------------------- INSERT INTO dbo.sample_memoryoptimizedtable_Range(c1, c2, c3) SELECT 1, 1, 1 UNION ALL SELECT 2, 1, 1 UNION ALL SELECT 8, 1, 1 UNION ALL SELECT 3, 1, 1 UNION ALL SELECT 4, 1, 1 UNION ALL SELECT 5, 1, 1 UNION ALL SELECT 6, 1, 1 UNION ALL SELECT 7, 1, 1 /*----------------------------------------------*/ /*----------------------------------------------*/ INSERT INTO dbo.StringTable_Hash(PersonID, Name) SELECT 'A101054', 'John' UNION ALL SELECT 'A184848', 'Tim' UNION ALL SELECT 'C154887', 'Bill' UNION ALL SELECT 'B101054', 'Henry' UNION ALL SELECT 'B109854', 'Douglas' UNION ALL SELECT 'D101054', 'Charles' UNION ALL SELECT 'CA01054', 'Ben' -------------------------------------------------- INSERT INTO dbo.StringTable_Range(PersonID, Name) SELECT 'A101054', 'John' UNION ALL SELECT 'A184848', 'Tim' UNION ALL SELECT 'C154887', 'Bill' UNION ALL SELECT 'B101054', 'Henry' UNION ALL SELECT 'B109854', 'Douglas' UNION ALL SELECT 'D101054', 'Charles' UNION ALL SELECT 'CA01054', 'Ben' /*----------------------------------------------*/ /*----------------------------------------------*/ INSERT INTO dbo.Composite_Key_Hash(c1, c2, c3) SELECT 1, 1, 1 UNION ALL SELECT 2, 1, 1 UNION ALL SELECT 8, 1, 1 UNION ALL SELECT 3, 1, 1 UNION ALL SELECT 4, 1, 1 UNION ALL SELECT 5, 1, 1 UNION ALL SELECT 6, 1, 1 UNION ALL SELECT 7, 1, 1 -------------------------------------------------- INSERT INTO dbo.Composite_Key_Range(c1, c2, c3) SELECT 1, 1, 1 UNION ALL SELECT 2, 1, 1 UNION ALL SELECT 8, 1, 1 UNION ALL SELECT 3, 1, 1 UNION ALL SELECT 4, 1, 1 UNION ALL SELECT 5, 1, 1 UNION ALL SELECT 6, 1, 1 UNION ALL SELECT 7, 1, 1
Finally we are ready to perform the tests.
-
Hash indexes don't support range searches like <, > and BETWEEN. This is done by an index scan.
USE TestDB GO -- Range searches SELECT c1 FROM dbo.sample_memoryoptimizedtable_Hash WHERE c1 > 4 SELECT c1 FROM dbo.sample_memoryoptimizedtable_Range WHERE c1 > 4
-
Inequality searches are not supported. They perform a table scan.
USE TestDB GO -- Equality comparisons SELECT c1 FROM dbo.sample_memoryoptimizedtable_Hash WHERE c1 = 4 SELECT c1 FROM dbo.sample_memoryoptimizedtable_Range WHERE c1 = 4
-
Hash indexes aren't ordered, so an ORDER BY operation will add a SORT step on the execution plan.
USE TestDB GO -- ORDER BY operation SELECT c1 FROM dbo.sample_memoryoptimizedtable_Hash WHERE c1 = 4 OR c1 = 7 ORDER BY c1 SELECT c1 FROM dbo.sample_memoryoptimizedtable_Range WHERE c1 = 4 OR c1 = 7 ORDER BY c1
-
With Hash indexes only whole keys can be used to search for a row. For example if you have a table with a composite key like (C1, C2) then searching by C1 will result in an index scan.
USE TestDB GO -- Composite Key Behavior SELECT c1 FROM dbo.Composite_Key_Hash WHERE c1 = 4 SELECT c1 FROM dbo.Composite_Key_Range WHERE c1 = 4
-
A corollary of the above is that LIKE comparisons are not supported and will result in an index scan.
USE [TestDB] GO SELECT PersonID FROM dbo.StringTable_Hash WHERE PersonID LIKE 'C%' SELECT PersonID FROM dbo.StringTable_Range WHERE PersonID LIKE 'C%'
Conclusion
Although Hash indexes are a very powerful tool and can be very helpful in some situations, Hash indexing requires more planning than range indexing. In order to take full advantage of Memory-Optimized tables a SQL Server specialist must fully understand the difference between Hash and Range indexes
References
- Hekaton: SQL Server's Memory-Optimized OLTP Engine, Diaconu C., Freedman
C., Ismert E, Larson P., Mittal P., Stonecipher R., Verma N., Zwilling M. (http://research.microsoft.com/pubs/193594/Hekaton%20-%20Sigmod2013%20final.pdf).
- Delaney K. SQL Server In-Memory OLTP Internals Overview for CTP2 (http://download.microsoft.com/download/F/5/0/F5096A71-3C31-4E9F-864E-A6D097A64805/SQL_Server_Hekaton_CTP1_White_Paper.pdf).
Next Steps
- If you still don't have a version of SQL Server 2014, download a trial version here.
- Take a look at my previous tip about migrating to Memory-Optimized Tables: Overcoming storage speed limitations with Memory-Optimized Tables for SQL Server.
- This is a good tip to start getting in touch with indexes: SQL Server Indexing Basics.
- Also this one would give you more insight: Understanding SQL Server Indexing.
- After those tips you are ready for this one: SQL Server Index Checklist.
- In this tip you will learn how to deal with composite indexes: SQL Server Index Column Order - Does it Matter.
- To perform tests on your indexes you can read about Using Hints To Test SQL Server Indexes.
- For ORDER BY queries this is very important: Building SQL Server Indexes in Ascending vs Descending Order
- If you don't understand execution plans then this tutorial is for you: Graphical Query Plan Tutorial.
- Take a look at Indexing Tips Category.
- Also take a look at Performance Tuning Tips Category.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2013-12-24