By: Aaron Bertrand | Updated: 2020-06-17 | Comments | Related: > Extended Events
Problem
I was recently trying to troubleshoot a SQL Server replication-related deadlock that our monitoring tool didn't capture, and tried to find information about it in the system_health Extended Events session. With the default retention settings and the amount of noise contributed by security ring buffer events, I quickly discovered that the session only had data going back less than two hours. Meaning unless I started investigating an event immediately after it happened, all evidence had rolled out forever.
Solution
There are a few things you can do to make your system_health data last longer. An elaborate way would be to archive the .xel files themselves (or periodically take snapshots into your own storage), but I have two simpler solutions that might work for you.
Change the retention settings for system_health Extended Events session
By default, the system_health session retains 4 rollover files of a maximum of 5 MB each, allowing you to retain 20 MB of data. If you have a lot of events, especially if they're noisy, this squanders your ability to look at events in the past.
Luckily, unlike with the default trace, you can change the retention settings. For example, if I wanted to keep 40 rollover files of 10 MB each, for 400 MB of retention, I could run the following:
USE [master];
GO ALTER EVENT SESSION [system_health] ON SERVER
DROP TARGET package0.event_file;
ALTER EVENT SESSION [system_health] ON SERVER
ADD TARGET package0.event_file
(
SET filename =N'system_health.xel',
max_file_size = (10), -- MB
max_rollover_files = (40)
);
In SQL Server 2016, 2017, and 2019, the defaults – if you haven't altered them already – increase to 10 files of 100 MB each, allowing you to retain 1 GB of system_health data (see KB #4541132 for more details). You can still increase or decrease these settings after applying the relevant cumulative update, but what is the right number will be a balancing act between manageability and XML query performance.
Stop collecting unactionable noise for system_health Extended Events session
Increasing the retention is nice, but what if you're collecting 1 GB (or more) of garbage? On every system I looked at, 99% of the events we were collecting were security ring buffer errors we can't do anything about (they involve how apps connect, authenticate, and validate authentication). I ran the following query (I always dump XEvent session data to a #temp table before any further processing):
;WITH cte AS
(
SELECT ed = CONVERT(xml, event_data)
FROM sys.fn_xe_file_target_read_file(N'system_health*.xel', NULL, NULL, NULL)
)
SELECT event_data = x.ed.query('.')
INTO #t FROM cte CROSS APPLY cte.ed.nodes(N'.') AS x(ed); SELECT
EventName,
EventCount = COUNT(*),
EarliestEvent_MinutesAgo = DATEDIFF(MINUTE, MIN(EventTime), SYSUTCDATETIME())
FROM
(
SELECT
event_data.value(N'(event/@timestamp)[1]', N'datetime'),
event_data.value(N'(event/@name)[1]', N'nvarchar(255)')
FROM #t
) AS t(EventTime, EventName)
GROUP BY EventName
ORDER BY EventCount DESC;
Results:
Opening one of the XML values yielded this data for the event:
There is not a whole lot I can do, after the fact, about a harmless authentication validation for a session_id that is almost certainly no longer connected anyway. I have it on good authority that Microsoft is at least considering either augmenting this information with more details (so you can filter and maybe so it can be actionable), or removing it from system_health altogether.
Until then, I plan on dropping the event from system_health, in addition to increasing the retention, across our entire environment:
ALTER EVENT SESSION [system_health] ON SERVER
DROP EVENT sqlserver.security_error_ring_buffer_recorded;
The risk here is very small, since I didn't find a single event with useful information, and the rollback action is simple (just add the event back):
ALTER EVENT SESSION [system_health] ON SERVER
ADD EVENT sqlserver.security_error_ring_buffer_recorded
(SET collect_call_stack=(0)); -- maybe you don't need the call stack!
Alternatively, you could add this event to your own XEvent session (I talk about moving the useful events from the default trace to your own session here, here, and here).
Summary
The system_health XEvent session can contain some very useful troubleshooting information, but that data can easily get pushed out by short retention settings or too much noise. You can make some minor changes to this session to make the data you keep more useful and last longer.
Next Steps
See these related tips and other resources:
- Using Extended Events to troubleshoot SQL Server issues
- Query Data from Extended Events in SQL Server
- Using Powershell to Monitor the SQL Server system_health Extended Event Session
- Steps to Recover the system_health Extended Events Session in SQL Server
- Monitor Deadlocks in SQL Server with system_health Extended Events
- Built in SQL Server Extended Events Sessions
- Monitor Deadlocks in SQL Server with system_health Extended Events
- Removing the Default Trace : Part 1 | Part 2 | Part 3
- All SQL Server Extended Events Tips
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2020-06-17