Azure SQL Database - Table Partitioning

By: John Miner | Updated: 2015-01-29 | Comments (14) | Related: > Azure SQL Database

Problem

The Gartner Group specializes in surveying leading companies and determining industry trends in Information Technology. It is not surprising that cloud computing and big data (information explosion) are on top of the 2015 technology trending list.

In December of 2014, Microsoft release the preview version of Azure SQL Database update V12. The main purpose of this version is to narrow the syntactical differences between the on-premises and in-cloud database engines. The hope is that more companies will migrate their data to this software as a service platform.

Given these trends, the main question a database administrator might have is "How can I manage larger tables in Azure SQL Database?".

Solution

The new version of Azure SQL database has introduced table partitioning. This feature is part of the enterprise only version for on premises, but is available in all versions in the cloud.

To demonstrate this new feature we need to have a fictitious business problem. Since one of my majors in college was applied mathematics, I am going to solve a math problem.

Business Problem

Calculate and store the primes numbers from 1 to 1 million with ten data partitions. Thus, the primes numbers will be hashed in buckets at every one hundred thousand mark.

The trial division algorithm that I am going to introduce is a brute force method for calculating prime numbers. It is great for comparing the computing power of two machines by looking at overall execution times.

This routine consists of dividing a number n by each integer m which is greater than 1 and less than or equal to the square root of n. If the result of any of these divisions is an integer, then n is not a prime; otherwise, it is a prime.

Creating the database

I am using my MSDN ultimate license which comes with a free $150 per month Azure subscription. This is a great way to learn about what Azure has to offer without any real investment.

This demonstration assumes you have a Azure Database Server already created with a valid login. The server login I created is named jminer. It is important to record the web address of the server (connection string) since this will be used in SSMS. The image below shows the V12 preview has been enabled.

I will be using SQL Server Management Studio (SSMS) 2014 with cumulative update 5 installed to design and deploy the solution. I will be referring to the Azure Portal to review the results of our work.

To connect to our Azure Database server, enter the connection information using SQL Server standard authentication.

One statement that is still not supported is the USE statement. This limitation can be overcome by selecting the correct database in the object explorer and right clicking to open a new query window. I will be leaving this statement in the code since they are a reminder of what database you should be in. Executing this statement in the wrong database generates an error.

To verify the server version and default database, we can use the db_name() and @@version statements.

The code below recreates the MATH database.

/*  
 Create a database to hold the prime numbers
*/

-- Which database to use.
USE [master]
GO

-- Delete existing database
IF  EXISTS (SELECT name FROM sys.databases WHERE name = N'MATH')
DROP DATABASE MATH
GO

-- Create new database
CREATE DATABASE MATH
(
MAXSIZE = 20GB,
EDITION = 'STANDARD',
SERVICE_OBJECTIVE = 'S2'
)
GO

It is interesting to note two new keywords have been introduced to describe database type. I will be investigating this new syntax in my next tip.

Creating the Partition Function and Scheme

The key concept behind any type of horizontal partitioning is to group similar records into a single file group and/or file. In turn, this changes major record operations into file operations. For instance, DELETE all data with partition value Y turns into a remove file operation. Searching for data with partition value Y as part of the WHERE clause directs the storage engine to retrieve data from that one file.

The overall benefits should result in increased speed. However, like most things in life your delta might vary.

The main question that comes to mind is "How do we do create a partition scheme in Azure since we have no control over file placement?

The product team has assured me that mapping ALL the partitions to the PRIMARY file group will be optimized by the storage engine in Azure.

The diagram below is a conceptual view of how table partitioning works for our example in Azure SQL database.

The code below creates a partition function named PF_HASH_BY_VALUE and partition scheme named PS_HASH_BY_VALUE.

/*  
 Use table partitioning
*/

-- Which database to use.
USE [MATH]
GO

-- Create the partition function
CREATE PARTITION FUNCTION PF_HASH_BY_VALUE (BIGINT) AS RANGE LEFT 
FOR VALUES (100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000)
GO

-- Show the function
SELECT * FROM sys.partition_functions
GO

-- Create the partition scheme
CREATE PARTITION SCHEME PS_HASH_BY_VALUE 
AS PARTITION PF_HASH_BY_VALUE
ALL TO ([PRIMARY]);
GO

-- Show the scheme
SELECT * FROM sys.partition_schemes
GO

The output from querying the system tables is shown below.

A simple call to the $PARTITION system function can be used to test the hash index. The example below uses a derived table with key boundary values.

-- Test partition function
SELECT 
  MY_VALUE,
  $PARTITION.PF_HASH_BY_VALUE(MY_VALUE) AS HASH_IDX
FROM 
(
 VALUES 
   (1),
   (100001), 
   (200001), 
   (300001), 
   (400001), 
   (500001), 
   (600001), 
   (700001), 
   (800001), 
   (900001)
) AS TEST (MY_VALUE);
GO

The output from the test is shown below.

Creating the Partitioned Table

The TBL_PRIMES table contains three columns. The first one is the value of the prime number. The second one is how many divisions were tried before the number was declared prime. The third one is the date and time the data was stored. The first column is chosen as the primary key for the table.

The code below creates the new table with the partitioning scheme implemented on the primary key.

/*  
 Create a table to hold the prime numbers
*/

-- Which database to use.
USE [MATH]
GO

-- Delete existing table
IF  EXISTS (SELECT * FROM sys.objects 
  WHERE object_id = OBJECT_ID(N'[DBO].[TBL_PRIMES]') AND type in (N'U'))
DROP TABLE [DBO].[TBL_PRIMES]
GO

-- Add new table
CREATE TABLE [DBO].[TBL_PRIMES] 
(
  [MY_VALUE] [bigint] NOT NULL,
  [MY_DIVISION] [bigint] NOT NULL CONSTRAINT [CHK_TBL_PRIMES] CHECK ([MY_DIVISION] - 0),
  [MY_TIME] [datetime] NOT NULL CONSTRAINT [DF_TBL_PRIMES] DEFAULT (GETDATE())
  CONSTRAINT [PK_TBL_PRIMES] PRIMARY KEY CLUSTERED ([MY_VALUE] ASC)
) ON PS_HASH_BY_VALUE ([MY_VALUE])
GO

User defined stored procedures

First, we need a procedure that takes a number as a parameter and determines if it is prime. In this example we will use an old fashion WHILE loop. Some relational algebraic purest might argue that we should use a TALLY table. However, this is only a simple example focused on table partitioning.

The code below creates the procedure named SP_IS_PRIME.

/*  
 Create a procedure to determine if number is prime
*/


-- Which database to use.
USE [MATH]
GO

-- Delete existing procedure
IF  EXISTS (SELECT * FROM sys.objects 
  WHERE object_id = OBJECT_ID(N'[dbo].[SP_IS_PRIME]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[SP_IS_PRIME]
GO

-- Create the stored procedure from scratch
CREATE PROCEDURE [dbo].[SP_IS_PRIME]
    @VAR_NUM2 BIGINT
AS
BEGIN
    -- NO DISPLAY
    SET NOCOUNT ON
 
    -- LOCAL VARIABLES
    DECLARE @VAR_CNT2 BIGINT;
    DECLARE @VAR_MAX2 BIGINT;

    -- NOT A PRIME NUMBER
    IF (@VAR_NUM2 = 1)
        RETURN 0;            

    -- A PRIME NUMBER
    IF (@VAR_NUM2 = 2)
        RETURN 1;            

    -- SET UP COUNTERS    
    SELECT @VAR_CNT2 = 2;
    SELECT @VAR_MAX2 = SQRT(@VAR_NUM2) + 1;

    -- TRIAL DIVISION 2 TO SQRT(X)
    WHILE (@VAR_CNT2 <= @VAR_MAX2)
    BEGIN
        -- NOT A PRIME NUMBER
        IF (@VAR_NUM2 % @VAR_CNT2) = 0
            RETURN 0;            

        -- INCREMENT COUNTER
        SELECT @VAR_CNT2 = @VAR_CNT2 + 1;
        
    END;

    -- A PRIME NUMBER
    RETURN 1;
    
END
GO

Second, we need a procedure that takes a starting and ending value as input and calculates and stores primes numbers between those two values as output. This procedure will allow us to run multiple calls in parallel against Azure SQL Database at the same time.

The code below creates the procedure named SP_STORE_PRIMES.

/*    
 Create a procedure to store primes from x to y.
*/

-- Which database to use.
USE [MATH]
GO

-- Delete existing procedure
IF  EXISTS (SELECT * FROM sys.objects 
  WHERE object_id = OBJECT_ID(N'[dbo].[SP_STORE_PRIMES]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[SP_STORE_PRIMES]
GO

-- Create the stored procedure from scratch
CREATE PROCEDURE SP_STORE_PRIMES
    @VAR_ALPHA BIGINT,
    @VAR_OMEGA BIGINT
AS
BEGIN
    -- NO DISPLAY
    SET NOCOUNT ON
 
    -- DECLARE VARIABLES
    DECLARE @VAR_CNT1 BIGINT;
    DECLARE @VAR_RET1 INT;
    
    -- SET VARIABLES
    SELECT @VAR_RET1 = 0;
    SELECT @VAR_CNT1 = @VAR_ALPHA;

    -- CHECK EACH NUMBER FOR PRIMENESS
    WHILE (@VAR_CNT1 <= @VAR_OMEGA)
    BEGIN
        -- ARE WE PRIME?
        EXEC @VAR_RET1 = DBO.SP_IS_PRIME @VAR_CNT1;
        
        -- FOUND A PRIME
        IF (@VAR_RET1 = 1)
          INSERT INTO [DBO].[TBL_PRIMES] (MY_VALUE, MY_DIVISION) 
    VALUES (@VAR_CNT1, SQRT(@VAR_CNT1));
    
        -- INCREMENT COUNTER
        SELECT @VAR_CNT1 = @VAR_CNT1 + 1        
    END;
    
END
GO

Parallel execution

Many of the SQL Server tools that come with the on-premises edition work the same way for the cloud edition. I am going to leverage the SQLCMD utility in a batch program. The command line interpreter has the start keyword that can be used to launch a program asynchronously. Putting all this concepts together with the right calls to SP_STORE_PRIMES, we can calculate the prime numbers in ten even batches.

The command file below calls our user defined stored procedure to solve our business problem.

REM
REM  Calculate primes numbers <= 1M asynchronously.
REM 

REM [Partition 1]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 1, 100000;"

REM [Partition 2]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 100001, 200000;"

REM [Partition 3]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 200001, 300000;"

REM [Partition 4]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 300001, 400000;"

REM [Partition 5]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 400001, 500000;"

REM [Partition 6]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 500001, 600000;"

REM [Partition 7]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 600001, 700000;"

REM [Partition 8]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 700001, 800000;"

REM [Partition 9]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 800001, 900000;"

REM [Partition 10]
start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 
  -d MATH -Q "EXEC SP_STORE_PRIMES 900001, 1000000;"

Solution validation

Even though I have been in the IT industry for a quarter century, I still test and re-test my solutions to make sure that my algorithms work correctly for both positive and negative testing.

One question that a tester might have is "How do I know the data was stored in the correct partition?"

The Azure preview V12 has exposed over 100 new dynamic management views that the database administrator can use for monitoring and troubleshooting. The sys.dm_db_partition_stats view can be used to answer such a question. However, I already introduced the $PARTITION system function that can obtain the same answer.

The code below shows how to investigate row counts by partition number.

/*  
 Validate data placement
*/


-- Use dmv to get partitions
SELECT 
  Partition_Number, Row_Count 
FROM sys.dm_db_partition_stats
WHERE object_id = object_id('TBL_PRIMES'); 


-- Using the $PARTITION function
SELECT 
    $PARTITION.PF_HASH_BY_VALUE([MY_VALUE]) as Partition_Number, 
    COUNT(*) as Row_Count
FROM 
    MATH.[dbo].[TBL_PRIMES]
GROUP BY 
    $PARTITION.PF_HASH_BY_VALUE([MY_VALUE]);