By: Aaron Bertrand | Updated: 2020-02-17 | Comments (1) | Related: > Extended Events
Problem
I was recently troubleshooting a SQL Server 2016 instance with long checkpoints, which surfaced as FlushCache entries in the error log. Basically, this tells you that checkpoint operations were taking longer than the target recovery time, which can happen for a variety of reasons. More background in this archived post.
While SQL Server 2016 should have been using indirect checkpoints, this instance
was actually upgraded from an earlier version, so all of the databases were still
set to use automatic checkpoints (evidenced by sys.databases
having
target_recovery_time_in_seconds = 0
). For a detailed background on
why indirect checkpoints are better, see
this Microsoft post or
this prior tip. And while you may be quick to blame I/O, the older method can
be much slower – even on the fastest disk systems – if you have a lot
of memory, have many busy databases processing a lot of checkpoints, or both. For
background, this system has
XtremIO, but with 3TB of memory and hundreds of very active databases, this
is not a great combination for checkpoint performance.
I wanted to fix this scenario, but before just turning on indirect checkpoints on all 300+ databases on the instance, I wanted to check some metrics. For example, how often were checkpoints happening, and how long were they taking? This way, I could compare any single database before and after the change, and I could monitor the instance using all of my other tools while I flip the setting on for each database incrementally.
Solution
One way you can track checkpoints is to enable trace flag 3502 or 3504, depending on your version. But, aside from being undocumented (at least per this official list), these really blow up the event log, and worked against my purpose of reducing noise there. This would be particularly noisy if most checkpoints were not exceeding target recovery time, because they normally complete silently and without issue every few minutes or even seconds.
Also, there is something to be said for output being readable and parseable. A FlushCache message appears in the error log like this:
FlushCache: cleaned up 394031 bufs with 282252 writes in 65544 ms (avoided 21 new dirty bufs) for db 19:0
average writes per second: 4306.30 writes/sec
average throughput: 46.96 MB/sec, I/O saturation: 53644, context switches 101117
last target outstanding: 639360, avgWriteLatency 1
Or it may appear like this:
DirtyPageMgr::ForceCatchupOrFlushCache: cleaned up 166 dirty pages in 4 ms for db 19
In either case, that’s an annoying blob of text to parse, after having to clumsily extract the info from the error log in the first place. All just to determine which database was affected, and how badly.
Extended Events to the Rescue
Extended Events have two checkpoint-related events you can monitor, checkpoint_begin
and checkpoint_end
. Sadly, the latter has no duration
action, and while there is an action called sqlos.task_time
, this apparently
doesn’t correlate with clock time in any meaningful way. This means that in
order to determine how long a single checkpoint operation ran, you have to track
both the begin and end events, and correlate them after the fact. For this I can
use
causality tracking (even though it would be highly unlikely to get events out
of sequence).
I created the following Extended Events session to track checkpoints as they happen, for just a few databases at a time (starting with the ones that appeared most frequently in the error log):
CREATE EVENT SESSION CheckpointTracking ON SERVER
ADD EVENT sqlserver.checkpoint_begin
(
WHERE
( sqlserver.database_id = 19 -- dbA
OR sqlserver.database_id = 128 -- dbB
OR sqlserver.database_id = 274 -- dbC
)
),
ADD EVENT sqlserver.checkpoint_end
(
WHERE
( sqlserver.database_id = 19 -- dbA
OR sqlserver.database_id = 128 -- dbB
OR sqlserver.database_id = 274 -- dbC
)
)
ADD TARGET package0.event_file
(
SET filename = N'L:\XE_Out\CheckPointTracking.xel',
max_file_size = 128 /* MB */, max_rollover_files = 50
)
WITH
(
TRACK_CAUSALITY = ON, STARTUP_STATE = ON
);
GO
ALTER EVENT SESSION CheckpointTracking ON SERVER STATE = START;
Then, I went for a coffee, picked up the dry cleaning, and did a bunch of other things while I waited for the system to capture a bunch of checkpoint metrics. Once I returned, I ran this query just to verify events were getting captured:
SELECT
event = SUBSTRING([object_name],12,5),
event_data = CONVERT(xml, event_data),
rows = COUNT(*) OVER()
FROM sys.fn_xe_file_target_read_file('L:\XE_Out\CheckPoint*.xel', null, null, null);
Sure enough, I gave the process plenty of time. Too much, in fact. The query returned 17,496 rows that looked like this:
XML from a sample begin and end pair looked like this:
<!-- BEGIN Event:--/>
<event name="checkpoint_begin" package="sqlserver" timestamp="2020-01-20T00:19:04.167Z">
<data name="database_id">
<value>19</value>
</data>
<action name="attach_activity_id" package="package0">
<value>3A340E12-4103-4E37-A564-4CBC0D6E3F44-57</value>
</action>
</event>
<!-- END Event:--/> <event name="checkpoint_end" package="sqlserver" timestamp="2020-01-20T00:19:33.644Z">
<data name="database_id">
<value>19</value>
</data>
<action name="attach_activity_id" package="package0">
<value>3A340E12-4103-4E37-A564-4CBC0D6E3F44-58</value>
</action>
</event>
For those of you afraid to get on the Extended Events train, try doing this with Profiler! :-)
The query I used to work out the deltas is a little long-winded; it would have
been a lot simpler if checkpoint_end
included duration
.
In any case, the approach is basically:
- Dump the XML event data to a #temp table
- Use a CTE to query strongly-typed values from the XML exactly once
- Use a second CTE to "pair up" begin and end events
- Use a third CTE to calculate the duration
- Aggregate from there (event count, average and max duration)
Here is that code:
DROP TABLE IF EXISTS #xml;
GO SELECT ev = SUBSTRING([object_name],12,5), ed = CONVERT(xml, event_data)
INTO #xml
FROM sys.fn_xe_file_target_read_file('L:\XE_Out\CheckPoint*.xel', NULL, NULL, NULL); ;WITH Events(ev,ts,db,id) AS
(
SELECT ev,
ed.value(N'(event/@timestamp)[1]', N'datetime'),
ed.value(N'(event/data[@name="database_id"]/value)[1]', N'int'),
ed.value(N'(event/action[@name="attach_activity_id"]/value)[1]', N'uniqueidentifier')
FROM #xml
),
EventPairs AS
(
SELECT db, ev,
checkpoint_ended = ts,
checkpoint_began = LAG(ts, 1) OVER (PARTITION BY id, db ORDER BY ts)
FROM Events
),
Timings AS
(
SELECT
dbname = DB_NAME(db),
checkpoint_began,
checkpoint_ended,
duration_milliseconds = DATEDIFF(MILLISECOND, checkpoint_began, checkpoint_ended)
FROM EventPairs WHERE ev = 'end' AND checkpoint_began IS NOT NULL
)
SELECT
dbname,
checkpoint_count = COUNT(*),
avg_seconds = CONVERT(decimal(18,2),AVG(1.0*duration_milliseconds)/1000),
max_seconds = CONVERT(decimal(18,2),MAX(1.0*duration_milliseconds)/1000),
total_seconds_spent = CONVERT(decimal(18,2),SUM(1.0*duration_milliseconds)/1000)
FROM Timings
GROUP BY dbname
ORDER BY total_seconds_spent DESC;
The CTEs don’t have to be so delineated, but it makes it very easy to step through and see the output of each one in turn. The final output looked like this (database names obfuscated of course):
From this output I could determine which database I wanted to attack first, and
in this case it was dbA
, since it seems to checkpoint most frequently
and also spent the most overall time on checkpoint operations. I set it up for indirect
checkpoints using ALTER DATABASE
, and recorded the moment the change
finished into a second #temp table (using UTC time, because that’s what Extended
Events records, and because that’s the time we should all be using to record
server events):
ALTER DATABASE dbA SET TARGET_RECOVERY_TIME = 60 SECONDS; SELECT db = DB_ID(N'dbA'), ts = sysutcdatetime() INTO #db;
Then I killed some time by doing real work for a while (including checking various
monitors for any ill effects, which I didn’t find), and came back to the session
later. I wanted to see what impact the indirect checkpoint setting had on just this
database, before and after the change. I can do this by adjusting the Timings
CTE to join to the #db
table to distinguish the events that happened
before the change from the events that happened after:
Timings AS
(
SELECT
dbname = DB_NAME(ep.db)
+ CASE WHEN ep.checkpoint_began < db.ts THEN ' (before)' ELSE ' (after)' END,
ep.checkpoint_began,
ep.checkpoint_ended,
duration_milliseconds = DATEDIFF(MILLISECOND, ep.checkpoint_began, ep.checkpoint_ended)
FROM EventPairs AS ep
INNER JOIN #db AS db ON db.db = ep.db
WHERE ep.ev = 'end' AND ep.checkpoint_began IS NOT NULL
)
Sample output shows the staggering results (less than half as many checkpoints, in literally a fraction of the time):
As I added more databases to the indirect checkpoint method, I could simply keep
adding the data to #db
after each ALTER DATABASE
, and
my join would continue to work.
When I wanted to move on to more databases outside of these first three, I could
simply drop the XE session, and re-create it with a different set of database_id
values. As the week went on, I ended up with several more convincing results, all
indicating that I was on the right path. A subset (again, names obfuscated):
Over roughly the same time frame, some databases end up running more checkpoints after the change, which shouldn’t be surprising, because factors like transaction count and dirty page count can have very different trajectories based on the workload. But even with more checkpoints happening, the new algorithm is so much more efficient that the overall time spent running those checkpoints is still a fraction of the time spent before the change.
And once again I can’t stress enough how important it is to continue monitoring other aspects of instance performance while making these changes, and perform them in a test environment (if you can generate suitable workload patterns there) before thinking about winging this in production. If you have high-memory systems using Availability Groups, take note of the section "Always On Availability Groups" in some older documentation.
Conclusion
You may have instances of SQL Server where not all databases are using indirect
checkpoints, and performance may be suffering as a result (but not necessarily in
any obvious ways). You can get ahead of it, before FlushCache errors start filling
up your error log, by checking for any rows in sys.databases
where
target_recovery_time_in_seconds = 0
, monitoring them with Extended
Events as I’ve described here, and updating the worst offenders to use indirect
checkpoints. Note that I wasn’t interested at this point in time in tracking
why checkpoints were slow, which could be an entire project on its own,
involving correlating session-level wait stats with the Extended Events data.
In a future tip, I’ll talk about how to at least partially automate this over a larger environment, especially if it’s not practical to set up an Extended Events session on every single instance. You might miss some long-running checkpoints that don’t hit the threshold, but you can certainly still catch any that have recently added FlushCache entries to the error log.
Next Steps
Read on for related tips and other resources:
- Database Checkpoints
- SQL 2016 - It Just Runs Faster: Indirect Checkpoint Default
- Indirect Checkpoints in SQL Server 2012
- Causality Tracking in Extended Events
- Using Track Causality to Understand Query Execution
- All Extended Events tips
- How It Works: When is the FlushCache message added to SQL Server Error Log?
- SQL Server : large RAM and DB Checkpointing
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2020-02-17