Always On Availability Groups Monitoring and Alerting

By: Alejandro Cobar | Updated: 2018-12-03 | Comments (7) | Related: > Availability Groups

Problem

In the part 1 of this tip series, I showed you how to build the inventory of replicas that form an Availability Group. However, that’s just half of the picture, because you must be able to check the status of the databases within Availability Groups and this has to be a constant check to make sure the databases are “available”.

Solution

I will be presenting a script that gathers basic, but very useful information for each Availability Group database. This information will be obtained from the performance counter objects and with PowerShell we are going to be able to gather this information from all the availability databases from the replicas inventory you already have.

SQL Server Availability Groups Monitoring Script

The Availability Groups inventory is built first, because that way you can run this second script against a smaller subset of SQL Server instances. You can probably do it within one run, but that wouldn’t be as efficient.

Also, there are some performance counter objects that must be obtained from the primary replica and others from the secondary ones.

Information from Primary Replica

Log Bytes Flushed/sec: A log flush occurs when data is written from the log cache to the physical transaction log file on disk, every time a transaction is committed. In order to get the rate, you have to obtain a couple of samples and then subtract the values. That’s the reason why the “WAITFOR DELAY” is there, so you can have an accurate delta. This will give you an idea of the activity going on currently at the databases in the Primary Replica.

SELECT instance_name, cntr_value 
INTO #Logflushes1
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Databases%' 
  AND counter_name = 'Log Bytes Flushed/sec';

WAITFOR DELAY '00:00:01';

SELECT instance_name, cntr_value 
INTO #Logflushes2
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Databases%' 
  AND counter_name = 'Log Bytes Flushed/sec';

SELECT 
     ag.name AS 'GroupName'
     ,db_name(drs.database_id) AS 'Database'
     ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
     ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
     ,drs.synchronization_state_desc AS 'State'
     ,drs.synchronization_health_desc 'Health'
     ,ags.primary_replica AS 'PrimaryReplica'
     ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
FROM #Logflushes1 l1
JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
     ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
WHERE rs.role_desc = 'PRIMARY';

DROP TABLE #Logflushes1;
DROP TABLE #Logflushes2;

Information from Secondary Replica

Redone Bytes/sec: Amount of log records redone on the secondary database in the last second.
Log Bytes Received/sec: Amount of log records received by the secondary replica for the database in the last second.
Log Send Queue Size: Amount of log records of the primary database that has not been sent to the secondary databases, in kilobytes (KB).
Send Rate: Average rate at which primary replica instance sent data during last active period, in kilobytes (KB)/second.
Send Latency: Time (in seconds) it will take the Primary Replica to fully send all the log records to the Secondary Replica.
Redo Queue Size: Amount of log records that the Secondary Replica has to process after being received by the Primary Replica.
Redo Rate: Rate at which the Secondary Replica is effectively processing the log records received from the Primary Replica.

SELECT instance_name, cntr_value 
INTO #redo1
FROM sys.dm_os_performance_counters
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Redone Bytes/sec';

SELECT instance_name, cntr_value 
INTO #send1
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Log Bytes Received/sec';

WAITFOR DELAY '00:00:01';

SELECT instance_name, cntr_value 
INTO #redo2
FROM sys.dm_os_performance_counters
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Redone Bytes/sec';

SELECT instance_name, cntr_value 
INTO #send2
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Log Bytes Received/sec';

SELECT
      DB_NAME(rs.database_id) AS 'Database'
      ,r.replica_server_name AS 'SecondaryReplica'
      ,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
      ,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
      ,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
      ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
      ,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate' 
      ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
FROM sys.dm_hadr_database_replica_states rs
JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
      FROM #redo1 l1
      JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
      ) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
      FROM #send1 l1
      JOIN #send2 l2 ON l2.instance_name = l1.instance_name
      ) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);

DROP TABLE #send1;
DROP TABLE #send2;
DROP TABLE #redo1;
DROP TABLE #redo2;

PowerShell Script to Automate the Data Collection of SQL Server Availability Group Databases

Below is the PowerShell script that unifies all the code described above. This will let you grab all the information from your environment and store it in a central place for data collection, analysis and monitoring. Note: you will have to update some of the parameters in the PowerShell script below to match your environment settings.

$labServer = "XXX"
$inventoryDB = "XXX"

#Clean the table where the monitoring information will be stored
Invoke-Sqlcmd -Query "TRUNCATE TABLE XXX" -Database $inventoryDB -ServerInstance $server

Write-Host "Fetching all the Primary Replicas from the inventory"

#Fetch all the instances that have a PRIMARY role in an Availability Group
#This list of instances should ideally come from the list we already built in Part 1 of this series
/*
   This is an example of the result set that your query must return
   ############################################################
   # name                     # instance              # group #
   ############################################################
   # server1.domain.net,45000 # server1\MSSQLSERVER1  # AG1   #
   # server2.domain.net,45000 # server2 # server2     # AG1   #
   # server3.domain.net,45000 # server3 # server3     # AG2   #
   # server4.domain.net,45000 # server4\MSSQLSERVER2  # AG2   #
   ############################################################
*/

$PrimaryReplicasLookupQuery = /*Put in here your query that returns the Primary Replicas only*/
$PrimaryReplicas = Invoke-Sqlcmd -ServerInstance $server -Database $inventoryDB -Query $PrimaryReplicasLookupQuery

#Grab the current information from each Primary Replica
foreach ($PrimaryReplica in $PrimaryReplicas){

    #Fetch all the Secondary Replicas per Primary Replica
    Write-Host "Fetching all the Secondary Replicas of Availability Group" $PrimaryReplica.group

    $SecondaryReplicasLookupQuery = /*Pun in here your query that returns the Secondary Replica for the Primary Replica being evaluated*/
    $SecondaryReplicas = Invoke-Sqlcmd -ServerInstance $server -Database $inventoryDB -Query $SecondaryReplicasLookupQuery

    $PrimaryReplicaInformationQuery = "
    SELECT instance_name, cntr_value 
    INTO #Logflushes1
    FROM sys.dm_os_performance_counters
    WHERE object_name  LIKE '%:Databases%' 
      AND counter_name = 'Log Bytes Flushed/sec';

    WAITFOR DELAY '00:00:01';

    SELECT instance_name, cntr_value 
    INTO #Logflushes2
    FROM sys.dm_os_performance_counters
    WHERE object_name  LIKE '%:Databases%' 
      AND counter_name = 'Log Bytes Flushed/sec';

    SELECT 
        ag.name AS 'GroupName'
       ,db_name(drs.database_id) AS 'Database'
        ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
       ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
       ,drs.synchronization_state_desc AS 'State'
       ,drs.synchronization_health_desc 'Health'
       ,ags.primary_replica AS 'PrimaryReplica'
        ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
    FROM sys.availability_groups ag
    JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
    JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
    JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
    JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
    JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
         FROM #Logflushes1 l1
         JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
        ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
    WHERE rs.role_desc = 'PRIMARY';

    DROP TABLE #Logflushes1;
    DROP TABLE #Logflushes2;
    "

    Write-Host "Obtaining information from Primary Replica" $PrimaryReplica.instance
    $PrimaryReplicaResults = Invoke-Sqlcmd -Query $PrimaryReplicaInformationQuery -ServerInstance $PrimaryReplica.name -ErrorAction Stop -querytimeout 30

    #Now that we have the information from the Primary Replica, let's complement it with the information captured for the Secondary Replica
    foreach ($SecondaryReplica in $SecondaryReplicas){
        $SecondaryReplicaInformationQuery = "
        SELECT instance_name, cntr_value 
        INTO #redo1
        FROM sys.dm_os_performance_counters
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Redone Bytes/sec';

        SELECT instance_name, cntr_value 
        INTO #send1
        FROM sys.dm_os_performance_counters 
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Log Bytes Received/sec';

        WAITFOR DELAY '00:00:01';

        SELECT instance_name, cntr_value 
        INTO #redo2
        FROM sys.dm_os_performance_counters
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Redone Bytes/sec';

        SELECT instance_name, cntr_value 
        INTO #send2
        FROM sys.dm_os_performance_counters 
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Log Bytes Received/sec';

        SELECT
             DB_NAME(rs.database_id) AS 'Database'
            ,r.replica_server_name AS 'SecondaryReplica'
            ,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
            ,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
           ,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
            ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
           ,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate'
           ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
        FROM sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
        JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
            FROM #redo1 l1
           JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
           ) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
        JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
           FROM #send1 l1
           JOIN #send2 l2 ON l2.instance_name = l1.instance_name
           ) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);

        DROP TABLE #send1;
        DROP TABLE #send2;
        DROP TABLE #redo1;
        DROP TABLE #redo2;
        "

        Write-Host "Obtaining information from Secondary Replica" $SecondaryReplica.instance
        $SecondaryReplicaResults = Invoke-Sqlcmd -Query $SecondaryReplicaInformationQuery -ServerInstance $SecondaryReplica.name -ErrorAction Stop -querytimeout 30
    }

    #Fisrt let's insert the information of the Primary Replicas followed by their respective Secondary Replicas
    foreach($PrimaryReplicaResult in $PrimaryReplicaResults){

        #Build the INSERT statement if it returned at least 1 row
        if($PrimaryReplicaResults.Length -ne 0){

          #Build the insert statement
          $insert = "INSERT INTO XXX VALUES"
            $insert += "
          (
          '"+$PrimaryReplicaResult.PrimaryReplica+"',
             '"+$PrimaryReplicaResult.Database+"',
          "+$PrimaryReplicaResult.MDFSize+",
             "+$PrimaryReplicaResult.LDFSize+",
          '"+$PrimaryReplicaResult.State+"',
          '"+$PrimaryReplicaResult.Health+"',
          '"+$PrimaryReplicaResult.PrimaryReplica+"',
             "+$PrimaryReplicaResult.LogFlushed+",
             0,
             0,
             0,
             0,
             0,
             0,
             0      
           ),
         "
          
          #Store the Primary Replicas results in our centralized database
          Invoke-Sqlcmd -Query $insert.Substring(0,$insert.LastIndexOf(',')) -ServerInstance $server -Database $inventoryDB
       }
    }

    #Now let's update the information of each Primary Replica with the information of their respective Secondary Replica
    foreach($SecondaryReplicaResult in $SecondaryReplicaResults){

        #Build the UPDATE statement if it returned at least 1 row
        if($SecondaryReplicaResults.Length -ne 0){

          #Build the update statement
          $update = "UPDATE XXX"
            $update += "
            SET SecondaryReplica   = '"+$SecondaryReplicaResult.SecondaryReplica+"',
                LogSendQueueSize   = "+$SecondaryReplicaResult.LogSendQueueSize+",
                LogSendRate        = "+$SecondaryReplicaResult.LogSendRate+",
                SendLatency        = "+$SecondaryReplicaResult.SendLatency+",
                RedoQueueSize      = "+$SecondaryReplicaResult.RedoQueueSize+",
                RedoRate           = "+$SecondaryReplicaResult.RedoRate+",
                RedoLatency        = "+$SecondaryReplicaResult.RedoLatency+"
            WHERE PrimaryReplica   = '"+$PrimaryReplicaResult.PrimaryReplica+"' AND [Database] = '"+$SecondaryReplicaResult.Database+"';
         "
         
          #Execute the UPDATE statement
          Invoke-Sqlcmd -Query $update -ServerInstance $server -Database $inventoryDB
       }
    }
}

Write-Host "Done!"

After executing the PowerShell script, you should be able to have something that looks lot like the following:

With this final output, you can choose to do what you want:

Build a report, through Reporting Services and automate it as needed.
Build an HTML table and have it sent to you.
- Generate HTML Formatted Emails from SQL Server
- Send email in a tabular format using SQL Server database mail
Build any kind of alerting mechanism you feel comfortable with.

Next Steps

You can add as many performance counters as you like. There’s a very useful list at Microsoft’s website, where you can even see from what replica you have to capture the value for a particular counter.
The scripts presented in this tip are focused towards Availability Groups that have 1 secondary replica per Availability Group. A future version will be able to handle any number of replicas available, although the presentation of the results will have to vary a bit (at least this can be a very good starting point if you have nothing at all).
Download the PowerShell script.

About the author

Alejandro Cobar is an MCP, SQL Server DBA and Developer with 10+ years of experience working in a variety of environments.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2018-12-03

SQL Server AlwaysOn Availability Groups Monitoring and Alerting - Part 2