SQL Server Checkpoint Monitoring with Extended Events

By: Aaron Bertrand | Updated: 2020-02-17 | Comments (1) | Related: > Extended Events

Problem

I was recently troubleshooting a SQL Server 2016 instance with long checkpoints, which surfaced as FlushCache entries in the error log. Basically, this tells you that checkpoint operations were taking longer than the target recovery time, which can happen for a variety of reasons. More background in this archived post.

While SQL Server 2016 should have been using indirect checkpoints, this instance was actually upgraded from an earlier version, so all of the databases were still set to use automatic checkpoints (evidenced by sys.databases having target_recovery_time_in_seconds = 0). For a detailed background on why indirect checkpoints are better, see this Microsoft post or this prior tip. And while you may be quick to blame I/O, the older method can be much slower – even on the fastest disk systems – if you have a lot of memory, have many busy databases processing a lot of checkpoints, or both. For background, this system has XtremIO, but with 3TB of memory and hundreds of very active databases, this is not a great combination for checkpoint performance.

I wanted to fix this scenario, but before just turning on indirect checkpoints on all 300+ databases on the instance, I wanted to check some metrics. For example, how often were checkpoints happening, and how long were they taking? This way, I could compare any single database before and after the change, and I could monitor the instance using all of my other tools while I flip the setting on for each database incrementally.

Solution

One way you can track checkpoints is to enable trace flag 3502 or 3504, depending on your version. But, aside from being undocumented (at least per this official list), these really blow up the event log, and worked against my purpose of reducing noise there. This would be particularly noisy if most checkpoints were not exceeding target recovery time, because they normally complete silently and without issue every few minutes or even seconds.

Also, there is something to be said for output being readable and parseable. A FlushCache message appears in the error log like this:

FlushCache: cleaned up 394031 bufs with 282252 writes in 65544 ms (avoided 21 new dirty bufs) for db 19:0
      average writes per second: 4306.30 writes/sec
      average throughput:        46.96 MB/sec, I/O saturation: 53644, context switches 101117
      last target outstanding:   639360, avgWriteLatency 1

Or it may appear like this:

DirtyPageMgr::ForceCatchupOrFlushCache: cleaned up 166 dirty pages in 4 ms for db 19

In either case, that’s an annoying blob of text to parse, after having to clumsily extract the info from the error log in the first place. All just to determine which database was affected, and how badly.

Extended Events to the Rescue

Extended Events have two checkpoint-related events you can monitor, checkpoint_begin and checkpoint_end. Sadly, the latter has no duration action, and while there is an action called sqlos.task_time, this apparently doesn’t correlate with clock time in any meaningful way. This means that in order to determine how long a single checkpoint operation ran, you have to track both the begin and end events, and correlate them after the fact. For this I can use causality tracking (even though it would be highly unlikely to get events out of sequence).

I created the following Extended Events session to track checkpoints as they happen, for just a few databases at a time (starting with the ones that appeared most frequently in the error log):

CREATE EVENT SESSION CheckpointTracking ON SERVER 
ADD EVENT sqlserver.checkpoint_begin
(
   WHERE 
   (    sqlserver.database_id = 19   -- dbA
     OR sqlserver.database_id = 128  -- dbB
     OR sqlserver.database_id = 274  -- dbC
   )
),
ADD EVENT sqlserver.checkpoint_end
(
   WHERE 
   (    sqlserver.database_id = 19   -- dbA
     OR sqlserver.database_id = 128  -- dbB
     OR sqlserver.database_id = 274  -- dbC
   )
)
ADD TARGET package0.event_file
(
  SET filename      = N'L:\XE_Out\CheckPointTracking.xel',
      max_file_size = 128 /* MB */, max_rollover_files = 50
)
WITH 
(
  TRACK_CAUSALITY = ON, STARTUP_STATE = ON
);
GO

ALTER EVENT SESSION CheckpointTracking ON SERVER STATE = START;

Then, I went for a coffee, picked up the dry cleaning, and did a bunch of other things while I waited for the system to capture a bunch of checkpoint metrics. Once I returned, I ran this query just to verify events were getting captured:

SELECT 
    event      = SUBSTRING([object_name],12,5),
    event_data = CONVERT(xml, event_data), 
    rows       = COUNT(*) OVER()
FROM sys.fn_xe_file_target_read_file('L:\XE_Out\CheckPoint*.xel', null, null, null);

Sure enough, I gave the process plenty of time. Too much, in fact. The query returned 17,496 rows that looked like this:

XML from a sample begin and end pair looked like this:

<!-- BEGIN Event:--/>

<event name="checkpoint_begin" package="sqlserver" timestamp="2020-01-20T00:19:04.167Z">
  <data name="database_id">
    <value>19</value>
  </data>
  <action name="attach_activity_id" package="package0">
    <value>3A340E12-4103-4E37-A564-4CBC0D6E3F44-57</value>
  </action>
</event>

<!-- END Event:--/>
<event name="checkpoint_end" package="sqlserver" timestamp="2020-01-20T00:19:33.644Z">
  <data name="database_id">
    <value>19</value>
  </data>
  <action name="attach_activity_id" package="package0">
    <value>3A340E12-4103-4E37-A564-4CBC0D6E3F44-58</value>
  </action>
</event>

For those of you afraid to get on the Extended Events train, try doing this with Profiler! :-)

The query I used to work out the deltas is a little long-winded; it would have been a lot simpler if checkpoint_end included duration. In any case, the approach is basically:

Dump the XML event data to a #temp table
Use a CTE to query strongly-typed values from the XML exactly once
Use a second CTE to "pair up" begin and end events
Use a third CTE to calculate the duration
Aggregate from there (event count, average and max duration)

Here is that code:

DROP TABLE IF EXISTS #xml;
GO
SELECT ev = SUBSTRING([object_name],12,5), ed = CONVERT(xml, event_data)
 INTO #xml
 FROM sys.fn_xe_file_target_read_file('L:\XE_Out\CheckPoint*.xel', NULL, NULL, NULL);
;WITH Events(ev,ts,db,id) AS
(
  SELECT ev,
    ed.value(N'(event/@timestamp)[1]', N'datetime'),
    ed.value(N'(event/data[@name="database_id"]/value)[1]', N'int'),
    ed.value(N'(event/action[@name="attach_activity_id"]/value)[1]', N'uniqueidentifier')
  FROM #xml
), 
EventPairs AS
(
  SELECT db, ev, 
    checkpoint_ended = ts, 
    checkpoint_began = LAG(ts, 1) OVER (PARTITION BY id, db ORDER BY ts)
  FROM Events
),
Timings AS
(
  SELECT 
    dbname = DB_NAME(db), 
    checkpoint_began, 
    checkpoint_ended,
    duration_milliseconds = DATEDIFF(MILLISECOND, checkpoint_began, checkpoint_ended) 
  FROM EventPairs WHERE ev = 'end' AND checkpoint_began IS NOT NULL
)
SELECT 
  dbname,
  checkpoint_count    = COUNT(*),
  avg_seconds         = CONVERT(decimal(18,2),AVG(1.0*duration_milliseconds)/1000),
  max_seconds         = CONVERT(decimal(18,2),MAX(1.0*duration_milliseconds)/1000),
  total_seconds_spent = CONVERT(decimal(18,2),SUM(1.0*duration_milliseconds)/1000)
FROM Timings
GROUP BY dbname
ORDER BY total_seconds_spent DESC;

The CTEs don’t have to be so delineated, but it makes it very easy to step through and see the output of each one in turn. The final output looked like this (database names obfuscated of course):

From this output I could determine which database I wanted to attack first, and in this case it was dbA, since it seems to checkpoint most frequently and also spent the most overall time on checkpoint operations. I set it up for indirect checkpoints using ALTER DATABASE, and recorded the moment the change finished into a second #temp table (using UTC time, because that’s what Extended Events records, and because that’s the time we should all be using to record server events):

ALTER DATABASE dbA SET TARGET_RECOVERY_TIME = 60 SECONDS;
SELECT db = DB_ID(N'dbA'), ts = sysutcdatetime() INTO #db;

Then I killed some time by doing real work for a while (including checking various monitors for any ill effects, which I didn’t find), and came back to the session later. I wanted to see what impact the indirect checkpoint setting had on just this database, before and after the change. I can do this by adjusting the Timings CTE to join to the #db table to distinguish the events that happened before the change from the events that happened after:

Timings AS
(
  SELECT 
    dbname = DB_NAME(ep.db) 
           + CASE WHEN ep.checkpoint_began < db.ts THEN ' (before)' ELSE ' (after)' END, 
    ep.checkpoint_began, 
    ep.checkpoint_ended,
    duration_milliseconds = DATEDIFF(MILLISECOND, ep.checkpoint_began, ep.checkpoint_ended) 
  FROM EventPairs AS ep
  INNER JOIN #db AS db ON db.db = ep.db
  WHERE ep.ev = 'end' AND ep.checkpoint_began IS NOT NULL
)

Sample output shows the staggering results (less than half as many checkpoints, in literally a fraction of the time):

Checkpoint metrics before and after indirect checkpoint

As I added more databases to the indirect checkpoint method, I could simply keep adding the data to #db after each ALTER DATABASE, and my join would continue to work.

When I wanted to move on to more databases outside of these first three, I could simply drop the XE session, and re-create it with a different set of database_id values. As the week went on, I ended up with several more convincing results, all indicating that I was on the right path. A subset (again, names obfuscated):

More positive results from changing target recovery time

Over roughly the same time frame, some databases end up running more checkpoints after the change, which shouldn’t be surprising, because factors like transaction count and dirty page count can have very different trajectories based on the workload. But even with more checkpoints happening, the new algorithm is so much more efficient that the overall time spent running those checkpoints is still a fraction of the time spent before the change.

And once again I can’t stress enough how important it is to continue monitoring other aspects of instance performance while making these changes, and perform them in a test environment (if you can generate suitable workload patterns there) before thinking about winging this in production. If you have high-memory systems using Availability Groups, take note of the section "Always On Availability Groups" in some older documentation.

Conclusion

You may have instances of SQL Server where not all databases are using indirect checkpoints, and performance may be suffering as a result (but not necessarily in any obvious ways). You can get ahead of it, before FlushCache errors start filling up your error log, by checking for any rows in sys.databases where target_recovery_time_in_seconds = 0, monitoring them with Extended Events as I’ve described here, and updating the worst offenders to use indirect checkpoints. Note that I wasn’t interested at this point in time in tracking why checkpoints were slow, which could be an entire project on its own, involving correlating session-level wait stats with the Extended Events data.

In a future tip, I’ll talk about how to at least partially automate this over a larger environment, especially if it’s not practical to set up an Extended Events session on every single instance. You might miss some long-running checkpoints that don’t hit the threshold, but you can certainly still catch any that have recently added FlushCache entries to the error log.

Next Steps

Read on for related tips and other resources:

About the author

Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2020-02-17

Wednesday, February 9, 2022 - 2:45:42 PM - Jeff	Back To Top (89768)
Is there anything special that needs to be considered if I am changing the indirect checkpoints on a db that is part of replication?