By: Aaron Bertrand | Updated: 2022-04-08 | Comments (1) | Related: > Database Design
Problem
We often build logging or other insert-only tables where we store large strings like URLs, host names, or error messages. It's usually not until the table has terabytes of data that we realize there might have been a better way. If we are logging traffic or exceptions for our own application, it's likely that we record the same URL, host name, or error message on millions of rows. What if we only had to write that URL or host name or message text once, the first time we saw it? In this tip, I want to share one idea for abstracting away recurring values, reducing storage, and making search queries faster (especially those with wildcards) without requiring immediate changes in the application layer.
Solution
I mentioned dynamic dimension tables in the title because, typically, a dimension table is one we generate in advance with a known (or at least slowly changing) data domain. These are data sets that remain relatively static over time, but we don't always know all possible values up front. Examples include order status codes, lists of cities and states, an enumeration of major SQL Server versions, or – one of my favorites – a calendar table.
When there are new values coming into the system regularly, we can generate such a dimension on the fly, populating it with only new dimension values as we encounter them. Logging tables that can grow exponentially and unpredictably are exactly where it is worth investing in this kind of optimization.
Let's say we start with a logging table like this:
CREATE TABLE dbo.ExceptionLog ( LogID bigint IDENTITY(1,1), Hostname nvarchar(64), URI nvarchar(255), ErrorMessage nvarchar(255), EventTime datetime2(0) NOT NULL DEFAULT sysutcdatetime(), CONSTRAINT PK_ExceptionLog PRIMARY KEY (LogID) );
After some time, if the application is busy, we can imagine there will be a lot of rows in this table. To simulate this, without writing a low quality app that generates years' worth of exceptions quickly, we can generate some realistic data manually. I may have mentioned in previous tips that data generation is often my favorite part of these exercises.
First, we need a variety of host names (let's say a bunch of machines acting
as specific types of application servers). One way to do that is to pull some rows
from sys.all_objects
and use something that varies
a little, but not much; one attribute with a small range of variety is the length
of object names:
SELECT TOP (5) HostName = CONCAT(CASE LEN(o.name) % 2 WHEN 1 THEN N'WebApp' ELSE N'FileLoad' END, N'Server', LEN(o.name)) FROM sys.all_objects AS o ORDER BY NEWID();
Results (yours will absolutely vary):
HostName ---------------- WebAppServer25 FileLoadServer16 FileLoadServer20 WebAppServer19 WebAppServer21
If we expand that, we can add fake URI and EventTime values:
SELECT TOP (5) HostName = CONCAT(CASE LEN(o.name) % 2 WHEN 1 THEN N'WebApp' ELSE N'FileLoad' END, N'Server', LEN(o.name)), URI = N'/' + REPLACE(o.name, N'_', N'/'), EventTime = DATEADD(MINUTE, ABS(o.object_id) % 10000, '20220101') FROM sys.all_objects AS o ORDER BY NEWID();
Results:
HostName URI EventTime ---------------- -------------------------- ---------------- WebAppServer25 /fn/GetRowsetIdFromRowDump 2022-01-02 08:16 FileLoadServer16 /sp/MSdeleteretry 2022-01-01 06:44 FileLoadServer20 /trace/event/bindings 2022-01-05 22:52 WebAppServer19 /default/constraints 2022-01-01 06:48 WebAppServer21 /sp/addtabletocontents 2022-01-02 01:36
Now, all that's left is the error messages. Again, we want to have some
variety here, but not too much. It is also nice to have realistic text without having
to use our imagination. We can cross join to sys.messages
:
SELECT TOP (5) HostName = CONCAT(CASE LEN(o.name) % 2 WHEN 1 THEN N'WebApp' ELSE N'FileLoad' END, N'Server', LEN(o.name)), URI = N'/' + REPLACE(o.name, N'_', N'/'), MsgText = LEFT(msg.Text, 255), EventTime = DATEADD(MINUTE, ABS(o.object_id) % 10000, '20220101') FROM sys.all_objects AS o CROSS JOIN ( SELECT TOP (5) [text] FROM sys.messages WHERE language_id = 1033 ORDER BY NEWID() ) AS msg ORDER BY NEWID();
Results (with some truncation for page width):
HostName URI MsgText EventTime ---------------- ------------ ------------------- ---------------- WebAppServer25 /fn/GetRo... A corrupted mess... 2022-01-02 08:16 FileLoadServer16 /sp/MSdel... Invalid property... 2022-01-01 06:44 FileLoadServer20 /trace/ev... A corrupted mess... 2022-01-05 22:52 WebAppServer19 /default/... %.*ls: Page %d:%... 2022-01-01 06:48 WebAppServer21 /sp/addta... Warning: the flo... 2022-01-02 01:36
Next, we can go for scale. Increase the objects to 10,000 rows, the messages to 100 rows, and then repeat 100 times (producing a million rows into our logging table):
INSERT dbo.ExceptionLog(Hostname, Uri, ErrorMessage, EventTime) SELECT TOP (10000) HostName = CONCAT(CASE LEN(o.name) % 2 WHEN 1 THEN N'WebApp' ELSE N'FileLoad' END, N'Server', LEN(o.name)), URI = N'/' + REPLACE(o.name, N'_', N'/'), MsgText = LEFT(msg.Text, 255), EventTime = DATEADD(MINUTE, ABS(o.object_id) % 10000, '20220101') FROM sys.all_objects AS o CROSS JOIN ( SELECT TOP (100) [text] FROM sys.messages WHERE language_id = 1033 ORDER BY NEWID() ) AS msg ORDER BY NEWID();
GO 100
On my machine, this took a little over a minute, and we have our million rows.
Partial output of sp_spaceused
:
rows reserved data ------- --------- --------- 1000000 351368 KB 335920 KB
And if we look at the breakdown of distinct values in our three important columns:
SELECT Hosts = COUNT(DISTINCT HostName), HostSize = SUM(DATALENGTH(HostName))/1024, -- MB URIs = COUNT(DISTINCT URI), URISize = SUM(DATALENGTH(URI))/1024, -- MB Errors = COUNT(DISTINCT ErrorMessage), ErrorSize = SUM(DATALENGTH(ErrorMessage))/1024, -- MB AllThree = SUM(DATALENGTH(HostName) + DATALENGTH(URI)+ DATALENGTH(ErrorMessage))/1024 -- MB FROM dbo.ExceptionLog;
Results:
Hosts HostSize URIs URISize Errors ErrorSize AllThree ----- -------- ---- ------- ------ --------- -------- 48 29244 2507 46334 7288 207840 283418
That shows us the not-so-shocking news that these wide columns take up almost 90% of the table's 300+MB.
Now what?
We shouldn't need to keep roughly 20,000 copies of each host name, 400 copies of each URI, and 125 copies of each error message, especially on a very write-heavy table. As I mentioned in the opening, this is wasteful in terms of storage (and, by extension, memory and I/O), and makes for expensive searches for specific values (never mind substring and wildcard searches).
To address this, we can create separate dimension tables that just contain one copy of each values, and a corresponding integer ("surrogate key") that can be used in the logging table instead.
CREATE TABLE dbo.ExceptionHostnames ( HostnameID bigint IDENTITY(1,1), Hostname nvarchar(64), CONSTRAINT PK_ExceptionHostnames PRIMARY KEY (HostnameID) ); CREATE TABLE dbo.ExceptionURIs ( URIID bigint IDENTITY(1,1), URI nvarchar(255), CONSTRAINT PK_ExceptionURIs PRIMARY KEY (URIID) ); CREATE TABLE dbo.ExceptionErrorMessages ( ErrorMessageID bigint IDENTITY(1,1), ErrorMessage nvarchar(255), CONSTRAINT PK_ExceptionErrorMessages PRIMARY KEY (ErrorMessageID) );
Why bigint? I have made it a personal policy to always use bigint for identity columns because of seeing int exhaustion one time too many.
Then we can add nullable foreign key columns to the fact table:
ALTER TABLE dbo.ExceptionLog ADD HostnameID bigint NULL FOREIGN KEY REFERENCES dbo.ExceptionHostnames(HostnameID), URIID bigint NULL FOREIGN KEY REFERENCES dbo.ExceptionURIs(URIID), ErrorMessageID bigint NULL FOREIGN KEY REFERENCES dbo.ExceptionErrorMessages(ErrorMessageID);
Next, we add an AFTER INSERT trigger so that we don't miss any new hosts, URIs, or errors (and it's important we do this before back-filling the old rows). It needs to do two things: (1) generate new IDs for new values we haven't seen before, and (2) update the table with those new IDs.
CREATE TRIGGER dbo.ExceptionLog_DivertDimensions ON dbo.ExceptionLog AFTER INSERT AS BEGIN SET NOCOUNT ON; -- generate new IDs INSERT dbo.ExceptionHostnames(Hostname) SELECT Hostname FROM inserted AS i WHERE NOT EXISTS ( SELECT 1 FROM dbo.ExceptionHostnames WHERE Hostname = i.Hostname ) GROUP BY Hostname; INSERT dbo.ExceptionURIs(URI) SELECT URI FROM inserted AS i WHERE NOT EXISTS ( SELECT 1 FROM dbo.ExceptionURIs WHERE URI = i.URI ) GROUP BY URI; INSERT dbo.ExceptionErrorMessages(ErrorMessage) SELECT ErrorMessage FROM inserted AS i WHERE NOT EXISTS ( SELECT 1 FROM dbo.ExceptionErrorMessages WHERE ErrorMessage = i.ErrorMessage ) GROUP BY ErrorMessage; -- update IDs UPDATE l SET l.HostnameID = h.HostnameID FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionHostnames AS h ON l.Hostname = h.Hostname INNER JOIN inserted AS i ON i.LogID = l.LogID WHERE l.HostnameID IS NULL; UPDATE l SET l.URIID = u.URIID FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionURIs AS u ON l.URI = u.URI INNER JOIN inserted AS i ON i.LogID = l.LogID WHERE l.URIID IS NULL; UPDATE l SET l.ErrorMessageID = m.ErrorMessageID FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionErrorMessages AS m ON l.ErrorMessage = m.ErrorMessage INNER JOIN inserted AS i ON i.LogID = l.LogID WHERE l.ErrorMessageID IS NULL; END GO
Once the trigger is in place, we can populate the dimension tables with all the values we've seen so far. We can take our time with this, and for larger tables we could use log-friendly, non-disruptive batching, since the application has no idea there's been a change yet. For now, we'll continue populating both sets of columns, and clear out the wider values later.
INSERT dbo.ExceptionHostnames(Hostname) SELECT Hostname FROM dbo.ExceptionLog AS src WHERE NOT EXISTS ( SELECT 1 FROM dbo.ExceptionHostnames WHERE Hostname = src.Hostname ) GROUP BY Hostname; INSERT dbo.ExceptionURIs(URI) SELECT URI FROM dbo.ExceptionLog AS src WHERE NOT EXISTS ( SELECT 1 FROM dbo.ExceptionURIs WHERE URI = src.URI ) GROUP BY URI; INSERT dbo.ExceptionErrorMessages(ErrorMessage) SELECT ErrorMessage FROM dbo.ExceptionLog AS src WHERE NOT EXISTS ( SELECT 1 FROM dbo.ExceptionErrorMessages WHERE ErrorMessage = src.ErrorMessage ) GROUP BY ErrorMessage;
After that, we can start backfilling the ID columns in the fact table with the keys we generated above. This looks remarkably like the second half of the trigger:
UPDATE l SET l.HostnameID = h.HostnameID FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionHostnames AS h ON l.Hostname = h.Hostname WHERE l.HostnameID IS NULL; UPDATE l SET l.URIID = u.URIID FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionURIs AS u ON l.URI = u.URI WHERE l.URIID IS NULL; UPDATE l SET l.ErrorMessageID = m.ErrorMessageID FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionErrorMessages AS m ON l.ErrorMessage = m.ErrorMessage WHERE l.ErrorMessageID IS NULL;
Note: If the error messages that come in from the app contain tokens, like you see in sys.messages, this is probably fine. If the error messages are materialized with instance-specific inline values before logging happens, it might not be such a good idea, because you'll probably never have the same error message twice. A different approach in that case could be to get the pre-tokenized error string and the individual token values from the application at logging time, and store the individual values separately (it's always much easier to assemble a string from parts than to extract parts of a string).
The types of queries your users run against the logging table will help drive your indexing strategy for these columns, so I'll leave that part out for now. Being integers, though, they certainly open up a lot more doors than wide Unicode strings.
But an after trigger is not ideal
This will work in the short term: the application can keep inserting rows into
ExceptionLog
, and the trigger will sync everything
together. Since we're still populating the wider columns, any readers will
still be able to see those values without needing to know anything about the new
dimension tables we're populating.
But we don't want to keep this logic in the trigger for long, because it's hiding some of the data logic, and the nature of the movement is relatively expensive. Eventually we'll want to update the application (or the stored procedures it calls), remove the trigger, and empty the wider columns in the log table. How we do that depends on how quickly we can change the application.
If we can change the app right now
For any portions of the application code that read from
ExceptionLog
, we can create a view that handles the
joins for them and produces the same output columns as before, so the change we
need to make to the app is minimal:
CREATE VIEW dbo.vExceptionLog AS SELECT l.LogID, h.Hostname, u.URI, m.ErrorMessage, l.EventTime FROM dbo.ExceptionLog AS l INNER JOIN dbo.ExceptionHostnames AS h ON l.HostnameID = h.HostnameID INNER JOIN dbo.ExceptionURIs AS u ON l.URIID = u.URIID INNER JOIN dbo.ExceptionErrorMessages AS m ON l.ErrorMessageID = m.ErrorMessageID;
Once that view is in place, we can change the applications so that, when they write exceptions here, they are calling a stored procedure that deals with our new dimension tables without the use of a trigger (and without needlessly populating the wider columns). This is a slightly more invasive change to the application code if it's currently performing singleton inline inserts but, with any luck, we're just changing the stored procedure the application is already calling. This procedure will generate the IDs first so there is only one insert (leaving out error handling and proper transaction handling for brevity):
CREATE PROCEDURE dbo.LogException @Hostname nvarchar(64), @URI nvarchar(255), @ErrorMessage nvarchar(255) AS BEGIN SET NOCOUNT ON; DECLARE @h table(HostnameID int); DECLARE @u table(URIID int); DECLARE @m table(ErrorMessageID int); INSERT @h(HostnameID) SELECT HostnameID FROM dbo.ExceptionHostnames WHERE Hostname = @Hostname; IF @@ROWCOUNT = 0 BEGIN INSERT dbo.ExceptionHostnames(Hostname) OUTPUT inserted.HostnameID INTO @h(HostnameID) VALUES(@Hostname); END INSERT @u(URIID) SELECT URIID FROM dbo.ExceptionURIs WHERE URI = @URI; IF @@ROWCOUNT = 0 BEGIN INSERT dbo.ExceptionURIs(URI) OUTPUT inserted.URIID INTO @u(URIID) VALUES(@URI); END INSERT @m(ErrorMessageID) SELECT ErrorMessageID FROM dbo.ExceptionErrorMessages WHERE ErrorMessage = @ErrorMessage; IF @@ROWCOUNT = 0 BEGIN INSERT dbo.ExceptionErrorMessages(ErrorMessage) OUTPUT inserted.ErrorMessageID INTO @m(ErrorMessageID) VALUES(@ErrorMessage); END INSERT dbo.ExceptionLog(HostnameID, URIID, ErrorMessageID) SELECT (SELECT HostnameID FROM @h), (SELECT URIID FROM @u), (SELECT ErrorMessageID FROM @m); END GO
(If any of these dimension columns are non-NULLable in the ExceptionLog table, you'll have to fix that first, or continue populating them until you can change the table.)
We can test this with the following:
EXEC dbo.LogException @Hostname = N'My Laptop', @Uri = N'/fake/path/', @ErrorMessage = N'404 for fake path'; SELECT TOP (1) * FROM dbo.ExceptionLog ORDER BY LogID DESC; SELECT TOP (1) * FROM dbo.vExceptionLog ORDER BY LogID DESC;
Output (leaving out EventTime
):
LogID Hostname URI ErrorMessage HostnameID URIID ErrorMessageID ------- --------- ---- ------------ ---------- ----- -------------- 1000002 NULL NULL NULL 50 2509 7290 LogID Hostname URI ErrorMessage ------- --------- ----------- ----------------- 1000002 My Laptop /fake/path/ 404 for fake path
Once this is the common path for inserts, we can disable or drop the trigger (it won't impact any logic, but it still is going to perform additional, wasteful reads to determine that we already took care of the dimension values).
And when the application is writing and reading in a way that the old columns are no longer relevant, we can drop them. Of course, we'll first need to handle any dependent indexes or constraints. If you want to play it a little bit safer, you could set them all to NULL or empty string (in batches if you need to be less intrusive), but I'd be cautious here. I'd rather get an invalid column error from some obscure part of the app, and know about the problem immediately, than for sudden blank strings to go unnoticed.
If we can't change the app(s) right now
Quite commonly we have development cycles for the database side that are not in sync with the application side, or there are multiple applications that can't all be changed together, or one or more apps aren't even under our control. In these cases, we won't necessarily have the freedom to use our new dimension tables exclusively, but in part 2 I'll talk about ways we can minimize the write impact and shield users from the underlying change as we wait (and wait, and wait) to change the application(s).
Next Steps
- Look around your systems and see if there are any places where you're writing the same strings thousands and thousands of times.
- See these tips and other resources:
- Storing E-mail addresses more efficiently in SQL Server
- Storing E-mail addresses more efficiently in SQL Server - Part 2
- Make your SQL Server database changes backward compatible when adding a new column
- Make your SQL Server database changes backward compatible when dropping a column
- Make your SQL Server database changes backward compatible when renaming an entity
- Make your SQL Server database changes backward compatible when changing a relationship
- Making SQL Server Function Changes Backwards Compatible
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2022-04-08