By: Aaron Bertrand | Updated: 2019-03-26 | Comments (1) | Related: > SQL Server Agent
Problem
SQL Server exposes two settings for managing the SQL Server Agent log. One setting dictates how many log rows to keep, and this is broken out into options for any single job and for the agent subsystem as a whole. The second is a simple interval that dictates how old history data has to be before it is automatically deleted (measured in days, weeks, or months):
This is not very flexible and, the reality is, you're probably reading this because you have jobs with different history retention requirements. Not only are some jobs more important than others, where you want to keep more history than others, but you also might want some kind of logic that dictates where either age or count is more important.
Consider a very simple case of full and log backups. If you perform a full backup daily or weekly, do you really need the history for the last 100 events? If you back up the log every 5 minutes, will just having the last 100 entries be enough? What happens if you have 10 different backup jobs, or some jobs have a lot of steps? They're going to start trampling all over each other's histories.
Backups are not the best example because this information is also stored in msdb. However, it's an obvious illustration to demonstrate: how we make agent history more customizable.
Solution
Before we get to a solution, let's dive a little deeper into the problem.
In addition to needing flexibility between jobs and across the whole system, you also might need flexibility between the steps of a job. Consider the case where you have a job with five steps, but only one of them is really important (maybe the others are just initialization steps or send e-mail on success or failure). In the current design, there is no way to say keep the history for step 3, but not the others. If you think it's just noise, it's more than that: I set the max history rows to 10, and the max rows per job to 10. Then I created two jobs, one with 1 step, and the other with 5. I ran the first job twice, then the second job twice, and inspected the history.
Note that the job itself (step 0) takes up a history slot, so for the most recent execution of "Job with 5 steps", there are actually 6 history entries. Since we can only keep 10 history rows for any single job, the earlier execution gets truncated. SQL Server holds onto step 0 but deletes the history for steps 1 and 2.
There is an also issue if you only want to keep an 8- or 12-hour rolling window of job history. Maybe you have jobs that run more frequently during business hours, or get called on demand, so they are particularly volatile and/or volume-driven. Limiting to a specific number of rows or to the lowest granularity of a day boundary fail to meet the requirements exactly.
Finally, there may be scenarios where you want to keep all failures around regardless of age or how many times the job or step has succeeded since, and you may want to keep any job or step that has exceeded some specified duration (but not all the other steps that succeeded or ran quickly).
From all of that, we get a simple list of requirements. We want a job history retention policy that allows us to:
- define retention history per job and per step
- keep as many or as few rows as we want
- keep rows for as short or as long as we want
- when rules for both are present, dictate which one "wins"
- keep history around for any step that has failed, even outside of our rules, or that has exceeded some specified duration
Let's start with a table in msdb that can store details about our preferences (basically the structure of the rules around which rows to keep/delete). Most columns are named intuitively:
USE msdb;
GO CREATE TABLE dbo.RetentionRules
(
job_id uniqueidentifier NOT NULL,
step_id int NOT NULL,
NumberOfRowsToKeep int,
NumberOfHoursToKeep int,
PreferRowsOrHours char(1),
KeepStepsThatFailed bit,
DurationOverrideSeconds int, -- keep any step that ran longer than this
CONSTRAINT PK_JobStep PRIMARY KEY (job_id, step_id),
CONSTRAINT CK_RowsOrHours CHECK (PreferRowsOrHours IN ('R','H'))
/*,
CONSTRAINT FK_RetentionRules_JobSteps
FOREIGN KEY(job_id, step_id) REFERENCES dbo.sysjobsteps(job_id, step_id)
*/
);
You could put a foreign key here, which I've commented out, but you might not want to have job changes blocked by rules surrounding the retention table. You could instead put a DML trigger on sysjobsteps to remind you to check on your retention schedule to ensure a new step isn't slipping through the cracks. Not much of a concern if a step is deleted and you still have retention rules for it, but it is a big deal if you add a new step and suddenly its history grows unbounded. Don't worry; we'll create a safety net for cleaning up jobs after a specified amount of time, overriding our rules at a certain point.
Next, if we want to eliminate any caps on jobs or steps (exceeding the max 1,000 row threshold), we need to remove the caps altogether. If you know you only ever want to limit steps to less than this 1,000 row limit, you can leave it in place, but if you want to keep more than 1,000 rows of history for any job, you can remove this cap with the following call:
EXEC msdb.dbo.sp_set_sqlagent_properties @jobhistory_max_rows = -1, @jobhistory_max_rows_per_job = -1;
The msdb database suffers from some unfortunate early database design decisions that have remained in place through current versions. The most notable one is the use of integers to store date and time, separately, in the sysjobhistory table. So, I've created a function and view to remove those calculations from any of my queries:
CREATE FUNCTION dbo.TVF_Agent_Datetime
(
@date int,
@time int,
@duration int
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
(
SELECT
StartDate,
EndDate = DATEADD(SECOND, DurationInSeconds, StartDate),
DurationInSeconds
FROM
(
SELECT
StartDate = CONVERT(datetime,
CONVERT(varchar(4), @date / 10000)
+ CONVERT(varchar(2), RIGHT('0' + RTRIM(@date % 10000 / 100),2))
+ CONVERT(varchar(2), RIGHT('0' + RTRIM(@date % 100),2))
+ ' ' + CONVERT(varchar(2), RIGHT('0' + RTRIM(@time / 10000),2))
+ ':' + CONVERT(varchar(2), RIGHT('0' + RTRIM(@time % 10000 / 100),2))
+ ':' + CONVERT(varchar(2), RIGHT('0' + RTRIM(@time % 100),2))),
DurationInSeconds =
@duration / 1000000 * 24 * 60 * 60
+ @duration / 10000 % 100 * 60 * 60
+ @duration / 100 % 100 * 60
+ @duration % 100
) AS x
);
GO CREATE VIEW dbo.JobStepHistory
AS
SELECT
h.instance_id, h.job_id, h.step_id, h.run_status,
f.StartDate, f.EndDate, f.DurationInSeconds
FROM dbo.sysjobhistory AS h
CROSS APPLY
dbo.TVF_Agent_Datetime(h.run_date, h.run_time, h.run_duration) AS f;
With that function in place, I only need to create a job with a few steps, and insert retention rules into my table, to start testing out the logic I would use to implement those rules. The job, scheduled to run every minute, has three steps: one that succeeds, one that runs longer than 30 seconds, and one that fails:
After the job executes, we can look at how the data is presented in the view I created:
SELECT TOP (4) *
FROM dbo.JobStepHistory
ORDER BY instance_id DESC;
And then in the native sysjobhistory table:
SELECT instance_id, step_id, step_name, sql_message_id, run_status, message
FROM dbo.sysjobhistory
WHERE instance_id IN (3393,3394,3395,3396)
ORDER BY instance_id DESC;
Now the retention rules come into play, because I want to make sure I capture and retain any time step 2 runs longer than 30 seconds, or any time step 3 fails, but I don't need to maintain data about all the times step 1 succeeds. So, I'll put these rows into my RetentionRules table:
DECLARE @job_id uniqueidentifier; SELECT @job_id = job_id
FROM dbo.sysjobs
WHERE name LIKE N'Job A%'; INSERT dbo.RetentionRules
(
job_id,
step_id,
NumberOfRowsToKeep,
NumberOfHoursToKeep,
PreferRowsOrHours,
KeepStepsThatFailed,
DurationOverrideSeconds
)
VALUES
(@job_id, 0, 100, NULL, 'R', 1, 180), -- keep 100 copies of job outcome
(@job_id, 1, 15, NULL, 'R', 1, 90), -- keep 15 rows for step 1, unless fail / > 90 seconds
(@job_id, 2, 80, 72, 'H', 1, 30), -- keep 80 rows / 72 hours for step 2
-- unless fail / > 30 seconds
(@job_id, 3, 75, NULL, 'R', 1, 90); -- keep 75 rows for step 3, unless fail / > 90 seconds
Now we just need a simple query to apply our rules to history, deleting the rows that are just noise, and keeping the rows that we want, depending on the preferences we've defined. Over time, we should see that we always discard history for step 1, and we keep the history for steps 2 and 3. Up to a point, though. I highly recommend enforcing an absolute age cutoff so you're not keeping history for failed or long-running steps for all of eternity. In this case I'll define that as 1,000 hours, but you can choose what's right for you based on your environment.
DECLARE @FallBackAgeInHours int = 1000; ;WITH AllStepZeroes AS
(
-- all instances of step zero (outcome)
-- since no relational way to connect instance/steps
-- LEAD() requires 2012+ SELECT instance_id, job_id, StartDate,
NextStart = LEAD(startDate, 1) OVER (PARTITION BY job_id ORDER BY StartDate)
FROM dbo.JobStepHistory
WHERE step_id = 0
), JoinHistoryToRules AS
(
-- find all the steps and match them to their retention rules SELECT h.*,
RowsToKeep = r.NumberOfRowsToKeep,
HoursToKeep = r.NumberOfHoursToKeep,
r.PreferRowsOrHours,
r.KeepStepsThatFailed,
r.DurationOverrideSeconds,
-- simple counter to track number of instances of any job/step combo:
rn = ROW_NUMBER() OVER (PARTITION BY h.job_id, h.step_id ORDER BY h.StartDate DESC)
FROM dbo.JobStepHistory AS h
INNER JOIN dbo.RetentionRules AS r
ON h.job_id = r.job_id
AND h.step_id = r.step_id
), Step1 AS
(
SELECT *, -- determine if we should keep or discard a row: -- if we have more row_number()s than the number we want to keep: DiscardDueToNumber = CASE
WHEN rn > RowsToKeep
THEN 1 ELSE 0 END, -- if we have rows older than our age threshold: DiscardDueToAge = CASE
WHEN EndDate < DATEADD(HOUR, -HoursToKeep, GETDATE())
THEN 1 ELSE 0 END, -- if we want to keep steps that failed: KeepDueToFailure = CASE
WHEN step_id > 0
AND run_status = 0 – failed
AND KeepStepsThatFailed = 1
THEN 1 ELSE 0 END, -- if we want to keep steps that ran longer than a defined threshold: KeepDueToDuration = CASE
WHEN DurationInSeconds > DurationOverrideSeconds
THEN 1 ELSE 0 END FROM JoinHistoryToRules
), Step2 AS
(
SELECT DeleteMe = CASE
WHEN
( -- discard due to age unless we care about number of rows
-- or vice versa. Captures case where both are true, too. (DiscardDueToNumber = 1 AND PreferRowsOrHours = 'R')
OR
(DiscardDueToAge = 1 AND PreferRowsOrHours = 'H')
)
AND
(
-- if we are keeping due to failure or runtime
-- let's not keep them forever (KeepDueToFailure = 0 AND KeepDueToDuration = 0)
OR StartDate < DATEADD(HOUR, -@FallBackAgeInHours, GETDATE())
)
THEN 1 ELSE 0 END, *
FROM Step1
), Step3 AS
(
SELECT *, -- keep step 0 if we've kept any steps from that instance KeepStep0 = CASE WHEN step_id = 0 AND NOT EXISTS
(
SELECT 1 FROM AllStepZeroes AS w
WHERE w.job_id = Step2.job_id
AND Step2.step_id > 0
AND Step2.DeleteMe = 1
AND Step2.StartDate >= w.StartDate
AND Step2.StartDate < w.NextStart
) THEN 1 ELSE 0 END
FROM Step2
) SELECT *. -- change to DELETE when happy:
--DELETE h
FROM dbo.sysjobhistory AS h
INNER JOIN Step3
ON Step3.instance_id = h.instance_id
AND Step3.DeleteMe = 1
AND Step3.KeepStep0 = 0; -- delete all other instances of jobs more than @FallBackAgeInHours old ;WITH LastFullInstanceBeforeCutoff AS
(
SELECT job_id, instance_id = MAX(instance_id)
FROM dbo.JobStepHistory
WHERE step_id = 0
AND EndDate < DATEADD(HOUR, -@FallBackAgeInHours, GETDATE())
GROUP BY job_id
)
SELECT *
--DELETE h
FROM dbo.sysjobhistory AS h
INNER JOIN LastFullInstanceBeforeCutoff AS l
ON h.job_id = l.job_id
AND h.instance_id <= l.instance_id;
When you are happy that this identifies the rows you want to delete from history, and ignores the ones that should be kept, change the two post-CTE SELECT statements to DELETE, and schedule it to run as frequently as you want to clean up the history table. If you schedule this using SQL Server Agent, make sure you go back and add that job to your rules, otherwise you'll be keeping that history forever.
Future Considerations
You could get even more complex.
Jobs fail for all kinds of reasons, and sometimes further troubleshooting is necessary, and other times it is not. Since sysjobhistory has sql_message_id, indicating the error message, you could keep history for one type of exception, and discard it for another. And perhaps how long you want to keep history changes over the weekend, where you have less volume but want to see on Monday everything that happened since Friday. So, you could add conditional rules for how to handle history on a Friday (keep for 96 hours), Saturday (keep for 72 hours), or Sunday (keep for 48 hours). And of course, you could make it very easy to delete job history once you've reviewed and/or resolved the issue. There are a lot of possibilities here; if you have additional rules you'd like to apply, mention them below, and I'll do what I can to address them.
Summary
Maintaining job history using the native settings can be rather limiting. Creating your own rule system and applying it takes a little bit of thought, but can be very helpful in the long run, keeping only the history you care about and eliminating all the noise.
Next Steps
Read on for related tips and other resources:
- Verbose SQL Server Agent Logging
- Creating Multi-Step and Dynamic SQL Server Agent Jobs
- SQL Server Agent Job Advanced Management
- SQL Server Agent Jobs Monitoring Script
- Auditing for New SQL Server Agent Jobs
- All SQL Server Agent tips
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2019-03-26