By: John Miner | Updated: 2018-05-10 | Comments (2) | Related: > Azure SQL Database
Problem
Over the next few years, companies will be moving to the cloud for the speed in which services can be deployed, the ability to scale services when demand changes, and potential cost savings from eliminating data centers. Many companies have on premise applications and/or operational data marts that use Microsoft SQL Server as a back-end database. How can you make some implementation changes to ease maintenance and reduce storage size in the cloud?
In the past, table partitioning and data compression might not have been incorporated into your database design due to the cost of an on-premises Enterprise license. With Azure SQL database, both of these techniques are available for your use in any pricing tier.
Today, I am going to talk about how to implement a sliding window to efficiently remove historical data. After writing a bunch of tips over the past three years, I realized I never went back to address this important technique. You should take a look at my first tip for a basic understanding of table partitioning.
Solution
The business users require N partitions of data to be on line and available at all times. The value of N is a time period that can be expressed in months, quarters, years or etc. It all depends on how you want to partition the data.
A sliding window is a simple concept. At the beginning of a new time period, we remove the oldest partition from the front of the table and add a new partition to the end of the table. This type of maintenance job can be scheduled to execute automatically.
Business Problem
There are many financial companies that invest in the stock market via mutual funds. Our fictitious company named Big Jon Investments is such a company. They currently invest in the S&P 500 mutual fund but are thinking of investing in individual funds starting in 2017.
The investments department will need five years of historical data to make informed choices when picking stocks. Our boss has asked us to pull summarized daily trading data for the 505 stocks that make up this list. We completed the task of creating the comma delimited files by stock and year in a prior tip.
The current business requirement is to create and load a data mart in Azure that uses table partitioning. We want to partition the stock data by quarter.
How can we solve this business problem?
Overall Project Plan
We are going to use a combination of techniques that we learned in prior articles to accomplish the following tasks.
Task | Description |
---|---|
1 | Use PowerShell to combine CSV files. |
2 | Create Azure SQL server and database. |
3 | Create database schema with table partitioning. |
4 | Use BCP to load staging table. |
5 | Log processing details for audit trail. |
6 | Evaluate and implement data compression. |
7 | Create utility view. |
8 | Design add new partition procedure. |
9 | Code delete existing partition procedure |
Combining text files
The current stock data that we have is broken out by stock and year. The S&P 500 stock list comprises of 505 stocks. After doing the math, five years of data results in 2,525 comma separated value files. This occupies 39.7 MB of disk space.
Bulk Insert works really well with a small number of large files. The purpose of the PowerShell program is to combine the 505 files into one data file per year. The format of the resulting file will be a comma separated value file. Therefore, the program needs to remove all but the first header line when combining the files.
Since this program is only going to be used one time, I will be manually hard coding the execution parameters at the top. The input file directory and file extension pattern are required for identifying the files. The output file name is the final resting place for all the data.
The image below shows the file properties of the RAW directory shown from Windows Explorer. All the files are bucketed into sub-directories by year.
The PowerShell program below is only introducing a handful of new cmdlets that I have not talked about in the past articles. First, the Import-Csv cmdlet is used to read the file into an PS Object. Second, the Select-Object cmdlet can be used during debugging to work with a smaller set of files or be used to skip the header line in the CSV file. Third, the Export-Csv cmdlet converts to PS Object to a text file.
The Force parameter overwrites the existing file while the Append parameter adds content to the end of the file. Always use the NoTypeInformation parameter to suppress the type information from appearing the final file.
# Optional message Write-Host "Start – Combine Csv Data" # Set global variables $SrcPath = "C:\CSV DATA\RAW\S&P-2013\" $DstPath = "C:\CSV DATA\S&P-2013.CSV" $FileExt = ".CSV" # Get list of files $List = @() Get-ChildItem $SrcPath -recurse | Where-Object { $_.extension -eq "$FileExt" } | Foreach-Object { $List += $_.FullName } # Debugging - first five files #$List = $List | Select-Object -First 5 # Initialize counter $Cnt = 0 # For each file $List | Foreach-Object { # Change the path Set-Location -Path $SrcPath # Grab file name $FileName = $_ # Optional message Write-Host "File - Start - $FileName" # Read in csv file $Data = Import-Csv $FileName # Append data 2 existing file If ($Cnt -ne 0) { $Data = $Data | Select-Object -skip 1 $Data | Export-Csv -Path $DstPath -Append -NoTypeInformation } # Write new file else { $Data | Export-Csv -Path $DstPath -Force -NoTypeInformation }; # Increment counter $Cnt += 1 # Optional message Write-Host "File - End - $FileName" Write-Host " " } # Optional message Write-Host "End – Combine Csv Data"
The image below is taken from two over lapping explorer windows and shows the BCP directory containing five files. Each file represents all the S&P stock trading data for a given year. For some reason, the combined data is taking 44.0 MB of disk space. This is more space than when the data was broken into separate files by stock symbols.
In a nutshell, 2,525 files were combined by executing a PowerShell program five times. The output of the program is five files containing S&P 500 data for the years 2013 thru 2017. The next step is to create the Azure SQL server and database.
Azure SQL database
I decided to use the Azure Portal to create a database named db4stocks that is associated with the logical server named svr4tips18. Like many other production servers, there are two existing databases. We are going to add a third database to the server.
I often use the Azure Pricing calculator when working with clients. The two existing databases are set at the Basic pricing tier and are costing me $4.90 per month per database. The newly created database is set at a S1 pricing tier and has 20 DTUs of processing power and a maximum storage size of 250 GB. This database is only costing me $29.40 per month. Sometimes pricing varies by region and you should always double check before you give a budget to the client.
If you have any questions about creating a logical Azure SQL server and Azure SQL database by using the Azure portal, please see Microsoft’s on line documentation.
Stocks database schema
I am going to use SQL Server Management Studio (SSMS) to manage my Azure SQL server. Make sure you have the latest version of SSMS. Updates happen often and the newest version can be downloaded from here.
We need to connect to the server to start crafting Transact SQL (T-SQL) scripts. Choose the database engine as the server type. Enter the fully qualified name of the Azure SQL Server. Pick SQL Server authentication as the security option and supply the login/password of the server administrator. The image below shows a typical connect to server dialog box.
By default, you should be in the master database. Switch the database context to db4stocks, our newly created database. Executing the following script to create the database objects.
I think schemas are an underutilized object in most production databases. The power behind schemas is the ability to assign permissions to groups (users) at that level. For our solution, there will be an ACTIVE and STAGE schema.
There will be a STOCKS table in both schemas. The AUDIT table in the ACTIVE schema will keep track of the steps that were used to load the database.
The image below shows the output of the query that lists all user defined objects.
Of course, I did define a partition function named PF_HASH_BY_VALUE and a partition scheme called PS_HASH_BY_VALUE. The trading date of the stock is transformed into a surrogate key with the following format YYYYQQ. I am using a persisted calculated column to create this key.
See the image below that tests the partition function with sample values. The output of the query is the value and which partition it will be stored in. This is a good way to validate your function before using it with a table.
There is one important concept that you should implement when designing a partition scheme. For our data set, we have five years of data over four quarters. That means we need 20 partitions to store our data. Why are there 22 partitions shown above?
A hospital partition is located at the front and end of the partition scheme. This partition will collect invalid records with surrogate keys that are less than the defined valid values or greater than the defined valid values.
To round out database object discussion, it is important to capture major events that effect the database. We will be interested in when and how data is loaded into the database.
The above image shows a sample INSERT command that passes along the command text as input. All the other fields use default table constraints to pull in systems values.
To recap, a good database design is the foundation of any successful project.
Bulk Copy Program (BCP)
The bulk copy program is a utility that can be called from the command line. Command line parameters govern how the program executes with different inputs.
Typical parameters such as server name, user name, user password and default database allow the program to make a valid connection to the Azure SQL database. The input file is specified after the in key word. Batch size can be set with the –b parameter. This parameter determines how many records are inserted into the table before a commit is issued. The –m parameter specifies the maximum errors that are allowed before a failure and the error file is specified with the –e parameter. Last but not least, the format file contains the game plan on how to parse and load the data. Of course, we can’t forget about the destination table name.
The above image shows the BCP utility loading the 124 thousand plus records into the staging table in less than 13 seconds. This is extremely quick compared to the data load times using other technologies. Please note that the sqlcmd utility is called to insert notes into our audit table.
The snippet below shows the composition of our first format file. The field delimiters are defined by commas and the row delimiter is defined as a carriage return and line feed.
14.0 8 1 SQLCHAR 0 32 "," 1 st_symbol "" 2 SQLCHAR 0 32 "," 2 st_date "" 3 SQLCHAR 0 32 "," 3 st_open "" 4 SQLCHAR 0 32 "," 4 st_high "" 5 SQLCHAR 0 32 "," 5 st_low "" 6 SQLCHAR 0 32 "," 6 st_close "" 7 SQLCHAR 0 32 "," 7 st_adj_close "" 8 SQLCHAR 0 32 "\r\n" 8 st_volume ""
The output from our first execution of the batch file is shown below. There are a couple issues with the loaded data. First, the header line is showing up in the table. Second, each field still has quotes surrounding the data.
We did not receive any data type conversion errors since all the fields in the target table are defined as character. The header line can be removed by using the –F parameter which defines the physical row number that is considered the first line. Another way to remove this line is to filter the imported data using a WHERE clause.
Removing the quotes around each data element will take more work. The \ character can be used to escape a double quote in the format file. We need to account for quotes before and after the commas. The achieve this, we need to add a fake field that is not mapped to the STAGE.STOCKS table.
The snippet below shows the composition of our second format file. The column position of zero tells the BCP program to not map the first_quote field to the database table.
14.0 9 1 SQLCHAR 0 0 "\"" 0 first_quote "" 2 SQLCHAR 0 32 "\",\"" 1 st_symbol "" 3 SQLCHAR 0 32 "\",\"" 2 st_date "" 4 SQLCHAR 0 32 "\",\"" 3 st_open "" 5 SQLCHAR 0 32 "\",\"" 4 st_high "" 6 SQLCHAR 0 32 "\",\"" 5 st_low "" 7 SQLCHAR 0 32 "\",\"" 6 st_close "" 8 SQLCHAR 0 32 "\",\"" 7 st_adj_close "" 9 SQLCHAR 0 32 "\"\r\n" 8 st_volume ""
The output below shows the data for year 2013 in the correct textual format. We will cast the data to the final data type before inserting into the ACTIVE.STOCKS table.
The T-SQL code below moves the data from the STAGE to ACTIVE table. A truncate table command should be used to clear the staging table after this action. If we change the year value in the batch file and re-run the process for years 2013 to 2017, we will have all the required data in the correct place.
-- -- Move data from stage to active -- INSERT INTO ACTIVE.STOCKS SELECT CAST(S.ST_SYMBOL AS VARCHAR(32)) AS ST_SYMBOL, CAST(S.ST_DATE AS DATE) AS ST_DATE, CAST(S.ST_OPEN AS REAL) AS ST_OPEN, CAST(S.ST_HIGH AS REAL) AS ST_HIGH, CAST(S.ST_LOW AS REAL) AS ST_LOW, CAST(S.ST_CLOSE AS REAL) AS ST_CLOSE, CAST(S.ST_ADJ_CLOSE AS REAL) AS ST_ADJ_CLOSE, CAST(S.ST_VOLUME AS BIGINT) AS ST_VOLUME FROM STAGE.STOCKS AS S WHERE (S.ST_VOLUME <> 'volume') AND (S.ST_VOLUME <> '0')
The image below shows the contents of the audit table after completely loading the database.
In short, the BCP and SQLCMD utilities can be used in a batch file to automate the loading of comma separated value data into an Azure SQL database. We now have the data in our table. However, are we storing the data in the most efficient manor?
Data Compression
Microsoft SQL Server supports two types of table (data) compression: row and page.
The idea behind row compression is to change the meta data overhead associated with each column. A small value in a large data type is wasteful. Changing the data type to fit the value saves spaces. For instance, a field defined as a big integer (using 8 bytes of space) which has a value of 128 can be stored as a tiny integer (using 1 byte of space). Thus, we are saving 7 bytes of space. This is not a big savings but it adds up when you have a million records.
The algorithm used with page compression is a little more complicated. First, row compression is applied to the table. Second, prefix compression is applied to the values in the columns. Each column is examined for patterns or values that are repeated. These values are place into the first row right after the page header. Each column entry is reduced with a pointer to the first row’s pattern. The third operation is to apply dictionary compression. Now, the whole page is searched for common patterns instead of individual columns. These values are stored in the second row right after the pager header. Each entry in the data page is reduce further with a pointer to the second row’s pattern.
Not everything is perfect in life. This same saying applies to compression. In rare cases, the original row will be smaller than a compressed one. However, row compression might beat page compression or vice versa depending upon the data.
One quick way to compare compression statistics is to capture the output from the sp_spaceused for each version of the table. Let me demonstrate this technique.
The code below creates a temporary table to save our statistics.
-- -- Save statistics on compression -- -- Remove temporary table DROP TABLE IF EXISTS #TBL_SIZE GO -- Create temporary table CREATE TABLE #TBL_SIZE ( id int identity(1,1), name varchar(64), rows varchar(64), reserved varchar(64), data varchar(64), index_size varchar(64), unused varchar(64) ) GO
The ALTER TABLE command with the rebuild option can be used to change the data compression of an existing table while preserving contents. Let’s look at the STOCKS table in the STAGE schema. Please bulk copy one year’s worth of data into the staging table.
The Transact SQL code below sets the compression setting on the table to NONE.
-- Try no compression ALTER TABLE [STAGE].[STOCKS] REBUILD WITH (DATA_COMPRESSION = NONE); GO -- No compression INSERT INTO #TBL_SIZE exec sp_spaceused 'stage.stocks'; GO
The Transact SQL code below changes the compression setting on the table to ROW.
-- Try no compression ALTER TABLE [STAGE].[STOCKS] REBUILD WITH (DATA_COMPRESSION = ROW); GO -- No compression INSERT INTO #TBL_SIZE exec sp_spaceused 'stage.stocks'; GO
The Transact SQL code below modifies the compression setting on the table to PAGE.
-- Try no compression ALTER TABLE [STAGE].[STOCKS] REBUILD WITH (DATA_COMPRESSION = PAGE); GO -- No compression INSERT INTO #TBL_SIZE exec sp_spaceused 'stage.stocks'; GO
Now that we have statistics for each version, let us do some math. The relative percentage of saved space can be calculated as the new value minus the old value. Then divide the result but the old value and multiply by 100 to get a percentage. The image below shows the result of this calculation on the data we collected in the temporary table. We can see that PAGE compression saves us 37.7 percent in space.
Since the STAGE.STOCKS table is truncated after every load, it is optional to compress this table. Please load the staging table with five years of stock data. We can see that 633,324 records are being stored in 47 MB of data pages. The table only has one clustered index. Therefore, we are not using any index page space.
I suggest you go ahead and rebuild the ACTIVE.STOCKS table to use PAGE compression. The pricing tiers for Azure SQL database are currently set with 250, 500, 1024, and 4096 GB storage limits. Please see this web page for details since capacity and cost change over time. Using table compression will keep your disk space requirements down. You might be able to save money by using a lower tier.
Data Placement
Testing is a valuable exercise when coding a bunch of new routines. I want to make sure that the row data is being correctly hashed into the correct partitions.
Please truncate the ACTIVE.STOCKS table and load one year of S&P 500 stock data. The image below shows how the data is spread across partitions 2 to 5 for stock data captured in 2013.
Let us repeat the process for a full five years’ worth of data. Because my data is clean, no records end up in the hospital partitions. In real life, you should write a daily report so that your data steward can fix any records with invalid dates that end up in these partitions.
The sys.dm_db_partition_stats dynamic management view can tell us the partition id and row count. However, this is not good enough if we want to dynamically create a sliding window for our partitioned table. We need a utility (helper) view that will tell us the value used in the range comparison. These values determine the partition boundaries of your table.
Utility View
Instead of reinventing the wheel, it is sometimes quicker to reuse someone else’s work. I would like to thank Kalen Delaney for the base code I found on the internet. I just renamed it to my own coding standards.
The function called UFN_GET_INDEX_NAME returns the textual name of the index given the table id and index id. The view named UFN_GET_PARTITION_INFO is extremely useful since it returns detailed information about the partitioned table. Please see the completed sliding window script for details.
The T-SQL snippet below returns the information about the first three and last three partitions using this newly defined view.
Now that we filled the gap left by the dynamic management view, we can start working on designing and coding of the sliding window.
Add New Partition
The first action to perform during the sliding window process is to add a new partition at the end of the table. This task is more difficult to execute on premise since we have to calculate the file group to set as the NEXT USED in the partition scheme. Since we are using only the PRIMARY file group in the cloud offering, no calculation is necessary. The SPLIT RANGE function is called on the partition function to accomplish this task.
The diagram below shows our end hospital partition being split to hold new values for the first quarter of 2018.
The simplest way to keep track for partition boundaries and values is to create a table. The following T-SQL defines this table for us.
-- -- Create hash key table for sliding window -- -- Create new table CREATE TABLE [ACTIVE].[PARTITIONS] ( part_id int not null identity (1, 1), part_hash_key int not null ); GO -- Alter the table (primary key) ALTER TABLE [ACTIVE].[PARTITIONS] ADD CONSTRAINT PK_PARTITIONS_ID PRIMARY KEY CLUSTERED (part_id); GO -- Alter the table (unique key) ALTER TABLE [ACTIVE].[PARTITIONS] ADD CONSTRAINT UNQ_PARTITIONS_HASH UNIQUE (part_hash_key); GO
The stored procedure named USP_ADD_STOCKS_PARTITIONimplements the two step process to add a new partition to the end of our STOCKS table. Please see the completed sliding window script for details.
The following code executes the stored procedure.
-- Add new partition to end of table EXEC [ACTIVE].[USP_ADD_STOCKS_PARTITION]
I left debugging information in the script to show the exact dynamic transaction SQL statements that are being executed.
The query below uses the UFN_GET_PARTITION_INFO utility view. We can see the new and hospital partitions are in locations 22 and 23.
In review, the algorithm for adding a new partition to the end of the table is quite simple. I leave it up to you to examine and learn the code.
Delete Old Partition
The second action to perform during the sliding window process is to delete the oldest partition from the front of the table. This task is more difficult since it involves seven distinct steps. Again, coding this stored procedure for Azure is simpler since we do not need to worry about files and file groups. All objects exist on one file group named PRIMARY.
The table below lists all the steps to accomplish this task.
Step | Description |
---|---|
1 | Create temp table for removing data. |
2 | Switch partition #2 with temp table. |
3 | Merge hospital partition #1. |
4 | Merge empty partition #2. |
5 | Drop temp table. |
6 | Set next used file group. |
7 | Split range for hospital partition. |
There are two new partition functions that we need to talk about. The SWITCH PARTITION function logically moves data from the ACTIVE.STOCKS table to the TEMP.STOCKS table. This command is part of the ALTER TABLE syntax. If we were coding this for an on premise database, we would want to define this table on the same partition that we want to remove. The diagram below shows how the meta data is changed after execution.
It is very important to execute task one and task two first. Otherwise, the MERGE RANGE function in task three and task four would actually cause data move between partitions. Also, check and remediate any data in the hospital partition. The same rules apply here.
The above image shows how two partitions are merged into one. The largest boundary condition (value) is retained.
Since creating a temporary table is different than manipulating partitions, I create a separate stored procedure for this task. The stored procedure named USP_ADD_STOCKS_TEMP_TABLE uses dynamic transaction SQL to create the new table.
Last but not least, the stored procedure named USP_DEL_STOCKS_PARTITION combines the steps into one logical unit of work. The following code executes the new stored procedure.
-- Delete oldest partition
[ACTIVE].[USP_DEL_STOCKS_PARTITION]
I left debugging information in the script to show the exact dynamic transaction SQL statements that are being executed.
The query below uses the utility view to look at the first three partitions. We can see that any data from 201301 has been eliminated from the table.
Again, the algorithm for deleting an existing partition to the front of the table is not that complex. I leave it up to you to examine and learn the code.
Summary
Today, we talked about how to implement a sliding window to efficiently remove historical data from a partitioned table. Our boss at Big Jon Investments asked us to work with the investments department to create a stocks data mart using a sliding window for the partitioned table deployed in an Azure SQL database.
The investments department will need five years of historical data to make informed choices when picking stocks. A PowerShell program was created to combine existing files broken down by stock symbol and year into five files organized by year.
The only requirement from the investments department is that any extract, transform and/or load processes be efficient and logged. The BCP utility was used with a format file to quickly load data into a staging table in Azure. The SQLCMD utility was used to keep an audit trail of this ETL process. Last but not least, a simple INSERT statement was used to move data from the STAGE to ACTIVE schema. The final table named ACTIVE.STOCKS is partitioning the data by year and quarter.
Now that we have a data mart, are we storing the data in the most efficient manor? Azure SQL Server supports both ROW and PAGE compression. A simple technique was used to determine which compression setting, if any, is the best. It was noted that page compression is the best setting.
A sliding window is a simple concept. At the beginning of a new time period, we remove the oldest partition from the front of the table and add a new partition to the end of the table. This type of maintenance job can be scheduled to execute automatically.
Before coding the sliding window, a utility view was created to give us better insight into the partition boundary value. Also, a table was created to hold past and future partition hash keys. Two new functions were created to support a sliding window. The first function adds a new partition to the end of the table. The second function deletes the oldest partition from the front of the table.
Of course, the devil is in the details. One thing is for certain, coding a sliding window for an Azure SQL database table is simpler since we do not need to worry about secondary data files and file groups. If you want to read more on this subject, I suggest you read this TechNet article published in 2006 for SQL Server 2005. The basic mechanism behind table partitioning has not really changed much since then.
Next Steps
- Leveraging Azure Automation and PowerShell to schedule Transact SQL statements
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2018-05-10