SQL Server AlwaysOn Availability Groups Monitoring and Alerting - Part 2

By:   |   Updated: 2018-12-03   |   Comments (7)   |   Related: > Availability Groups


Problem

In the part 1 of this tip series, I showed you how to build the inventory of replicas that form an Availability Group. However, that’s just half of the picture, because you must be able to check the status of the databases within Availability Groups and this has to be a constant check to make sure the databases are “available”.

Solution

I will be presenting a script that gathers basic, but very useful information for each Availability Group database. This information will be obtained from the performance counter objects and with PowerShell we are going to be able to gather this information from all the availability databases from the replicas inventory you already have.

SQL Server Availability Groups Monitoring Script

The Availability Groups inventory is built first, because that way you can run this second script against a smaller subset of SQL Server instances. You can probably do it within one run, but that wouldn’t be as efficient.

Also, there are some performance counter objects that must be obtained from the primary replica and others from the secondary ones.

Information from Primary Replica

  • Log Bytes Flushed/sec: A log flush occurs when data is written from the log cache to the physical transaction log file on disk, every time a transaction is committed.  In order to get the rate, you have to obtain a couple of samples and then subtract the values. That’s the reason why the “WAITFOR DELAY” is there, so you can have an accurate delta. This will give you an idea of the activity going on currently at the databases in the Primary Replica.
SELECT instance_name, cntr_value 
INTO #Logflushes1
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Databases%' 
  AND counter_name = 'Log Bytes Flushed/sec';

WAITFOR DELAY '00:00:01';

SELECT instance_name, cntr_value 
INTO #Logflushes2
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Databases%' 
  AND counter_name = 'Log Bytes Flushed/sec';

SELECT 
     ag.name AS 'GroupName'
     ,db_name(drs.database_id) AS 'Database'
     ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
     ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
     ,drs.synchronization_state_desc AS 'State'
     ,drs.synchronization_health_desc 'Health'
     ,ags.primary_replica AS 'PrimaryReplica'
     ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
FROM #Logflushes1 l1
JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
     ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
WHERE rs.role_desc = 'PRIMARY';

DROP TABLE #Logflushes1;
DROP TABLE #Logflushes2;

Information from Secondary Replica

  • Redone Bytes/sec: Amount of log records redone on the secondary database in the last second.
  • Log Bytes Received/sec: Amount of log records received by the secondary replica for the database in the last second.
  • Log Send Queue Size: Amount of log records of the primary database that has not been sent to the secondary databases, in kilobytes (KB).
  • Send Rate: Average rate at which primary replica instance sent data during last active period, in kilobytes (KB)/second.
  • Send Latency: Time (in seconds) it will take the Primary Replica to fully send all the log records to the Secondary Replica.
  • Redo Queue Size: Amount of log records that the Secondary Replica has to process after being received by the Primary Replica.
  • Redo Rate: Rate at which the Secondary Replica is effectively processing the log records received from the Primary Replica.
SELECT instance_name, cntr_value 
INTO #redo1
FROM sys.dm_os_performance_counters
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Redone Bytes/sec';

SELECT instance_name, cntr_value 
INTO #send1
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Log Bytes Received/sec';

WAITFOR DELAY '00:00:01';

SELECT instance_name, cntr_value 
INTO #redo2
FROM sys.dm_os_performance_counters
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Redone Bytes/sec';

SELECT instance_name, cntr_value 
INTO #send2
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Database Replica%' 
  AND counter_name = 'Log Bytes Received/sec';

SELECT
      DB_NAME(rs.database_id) AS 'Database'
      ,r.replica_server_name AS 'SecondaryReplica'
      ,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
      ,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
      ,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
      ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
      ,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate' 
      ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
FROM sys.dm_hadr_database_replica_states rs
JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
      FROM #redo1 l1
      JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
      ) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
      FROM #send1 l1
      JOIN #send2 l2 ON l2.instance_name = l1.instance_name
      ) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);

DROP TABLE #send1;
DROP TABLE #send2;
DROP TABLE #redo1;
DROP TABLE #redo2;

PowerShell Script to Automate the Data Collection of SQL Server Availability Group Databases

Below is the PowerShell script that unifies all the code described above. This will let you grab all the information from your environment and store it in a central place for data collection, analysis and monitoring. Note: you will have to update some of the parameters in the PowerShell script below to match your environment settings.

$labServer = "XXX"
$inventoryDB = "XXX"

#Clean the table where the monitoring information will be stored
Invoke-Sqlcmd -Query "TRUNCATE TABLE XXX" -Database $inventoryDB -ServerInstance $server

Write-Host "Fetching all the Primary Replicas from the inventory"

#Fetch all the instances that have a PRIMARY role in an Availability Group
#This list of instances should ideally come from the list we already built in Part 1 of this series
/*
   This is an example of the result set that your query must return
   ############################################################
   # name                     # instance              # group #
   ############################################################
   # server1.domain.net,45000 # server1\MSSQLSERVER1  # AG1   #
   # server2.domain.net,45000 # server2 # server2     # AG1   #
   # server3.domain.net,45000 # server3 # server3     # AG2   #
   # server4.domain.net,45000 # server4\MSSQLSERVER2  # AG2   #
   ############################################################
*/

$PrimaryReplicasLookupQuery = /*Put in here your query that returns the Primary Replicas only*/
$PrimaryReplicas = Invoke-Sqlcmd -ServerInstance $server -Database $inventoryDB -Query $PrimaryReplicasLookupQuery

#Grab the current information from each Primary Replica
foreach ($PrimaryReplica in $PrimaryReplicas){

    #Fetch all the Secondary Replicas per Primary Replica
    Write-Host "Fetching all the Secondary Replicas of Availability Group" $PrimaryReplica.group

    $SecondaryReplicasLookupQuery = /*Pun in here your query that returns the Secondary Replica for the Primary Replica being evaluated*/
    $SecondaryReplicas = Invoke-Sqlcmd -ServerInstance $server -Database $inventoryDB -Query $SecondaryReplicasLookupQuery

    $PrimaryReplicaInformationQuery = "
    SELECT instance_name, cntr_value 
    INTO #Logflushes1
    FROM sys.dm_os_performance_counters
    WHERE object_name  LIKE '%:Databases%' 
      AND counter_name = 'Log Bytes Flushed/sec';

    WAITFOR DELAY '00:00:01';

    SELECT instance_name, cntr_value 
    INTO #Logflushes2
    FROM sys.dm_os_performance_counters
    WHERE object_name  LIKE '%:Databases%' 
      AND counter_name = 'Log Bytes Flushed/sec';

    SELECT 
        ag.name AS 'GroupName'
       ,db_name(drs.database_id) AS 'Database'
        ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
       ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
       ,drs.synchronization_state_desc AS 'State'
       ,drs.synchronization_health_desc 'Health'
       ,ags.primary_replica AS 'PrimaryReplica'
        ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
    FROM sys.availability_groups ag
    JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
    JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
    JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
    JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
    JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
         FROM #Logflushes1 l1
         JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
        ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
    WHERE rs.role_desc = 'PRIMARY';

    DROP TABLE #Logflushes1;
    DROP TABLE #Logflushes2;
    "

    Write-Host "Obtaining information from Primary Replica" $PrimaryReplica.instance
    $PrimaryReplicaResults = Invoke-Sqlcmd -Query $PrimaryReplicaInformationQuery -ServerInstance $PrimaryReplica.name -ErrorAction Stop -querytimeout 30

    #Now that we have the information from the Primary Replica, let's complement it with the information captured for the Secondary Replica
    foreach ($SecondaryReplica in $SecondaryReplicas){
        $SecondaryReplicaInformationQuery = "
        SELECT instance_name, cntr_value 
        INTO #redo1
        FROM sys.dm_os_performance_counters
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Redone Bytes/sec';

        SELECT instance_name, cntr_value 
        INTO #send1
        FROM sys.dm_os_performance_counters 
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Log Bytes Received/sec';

        WAITFOR DELAY '00:00:01';

        SELECT instance_name, cntr_value 
        INTO #redo2
        FROM sys.dm_os_performance_counters
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Redone Bytes/sec';

        SELECT instance_name, cntr_value 
        INTO #send2
        FROM sys.dm_os_performance_counters 
        WHERE object_name  LIKE '%:Database Replica%' 
          AND counter_name = 'Log Bytes Received/sec';

        SELECT
             DB_NAME(rs.database_id) AS 'Database'
            ,r.replica_server_name AS 'SecondaryReplica'
            ,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
            ,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
           ,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
            ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
           ,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate'
           ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
        FROM sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
        JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
            FROM #redo1 l1
           JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
           ) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
        JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
           FROM #send1 l1
           JOIN #send2 l2 ON l2.instance_name = l1.instance_name
           ) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);

        DROP TABLE #send1;
        DROP TABLE #send2;
        DROP TABLE #redo1;
        DROP TABLE #redo2;
        "

        Write-Host "Obtaining information from Secondary Replica" $SecondaryReplica.instance
        $SecondaryReplicaResults = Invoke-Sqlcmd -Query $SecondaryReplicaInformationQuery -ServerInstance $SecondaryReplica.name -ErrorAction Stop -querytimeout 30
    }

    #Fisrt let's insert the information of the Primary Replicas followed by their respective Secondary Replicas
    foreach($PrimaryReplicaResult in $PrimaryReplicaResults){

        #Build the INSERT statement if it returned at least 1 row
        if($PrimaryReplicaResults.Length -ne 0){

          #Build the insert statement
          $insert = "INSERT INTO XXX VALUES"
            $insert += "
          (
          '"+$PrimaryReplicaResult.PrimaryReplica+"',
             '"+$PrimaryReplicaResult.Database+"',
          "+$PrimaryReplicaResult.MDFSize+",
             "+$PrimaryReplicaResult.LDFSize+",
          '"+$PrimaryReplicaResult.State+"',
          '"+$PrimaryReplicaResult.Health+"',
          '"+$PrimaryReplicaResult.PrimaryReplica+"',
             "+$PrimaryReplicaResult.LogFlushed+",
             0,
             0,
             0,
             0,
             0,
             0,
             0      
           ),
         "
          
          #Store the Primary Replicas results in our centralized database
          Invoke-Sqlcmd -Query $insert.Substring(0,$insert.LastIndexOf(',')) -ServerInstance $server -Database $inventoryDB
       }
    }

    #Now let's update the information of each Primary Replica with the information of their respective Secondary Replica
    foreach($SecondaryReplicaResult in $SecondaryReplicaResults){

        #Build the UPDATE statement if it returned at least 1 row
        if($SecondaryReplicaResults.Length -ne 0){

          #Build the update statement
          $update = "UPDATE XXX"
            $update += "
            SET SecondaryReplica   = '"+$SecondaryReplicaResult.SecondaryReplica+"',
                LogSendQueueSize   = "+$SecondaryReplicaResult.LogSendQueueSize+",
                LogSendRate        = "+$SecondaryReplicaResult.LogSendRate+",
                SendLatency        = "+$SecondaryReplicaResult.SendLatency+",
                RedoQueueSize      = "+$SecondaryReplicaResult.RedoQueueSize+",
                RedoRate           = "+$SecondaryReplicaResult.RedoRate+",
                RedoLatency        = "+$SecondaryReplicaResult.RedoLatency+"
            WHERE PrimaryReplica   = '"+$PrimaryReplicaResult.PrimaryReplica+"' AND [Database] = '"+$SecondaryReplicaResult.Database+"';
         "
         
          #Execute the UPDATE statement
          Invoke-Sqlcmd -Query $update -ServerInstance $server -Database $inventoryDB
       }
    }
}

Write-Host "Done!"

After executing the PowerShell script, you should be able to have something that looks lot like the following:

availability group status results

With this final output, you can choose to do what you want:

Next Steps
  • You can add as many performance counters as you like. There’s a very useful list at Microsoft’s website, where you can even see from what replica you have to capture the value for a particular counter.
  • The scripts presented in this tip are focused towards Availability Groups that have 1 secondary replica per Availability Group. A future version will be able to handle any number of replicas available, although the presentation of the results will have to vary a bit (at least this can be a very good starting point if you have nothing at all).
  • Download the PowerShell script.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Alejandro Cobar Alejandro Cobar is an MCP, SQL Server DBA and Developer with 10+ years of experience working in a variety of environments.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2018-12-03

Comments For This Article




Thursday, August 31, 2023 - 5:35:59 PM - RC Back To Top (91520)
can someone please provide a script to create a table for data collection?

Tuesday, May 7, 2019 - 9:34:18 AM - Sergey Rybak Back To Top (79972)

About comment about error "Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression." - use only one datafile and one log file))) For other combinations run:

SELECT instance_name, cntr_value 
INTO #Logflushes1
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Databases%' 
  AND counter_name = 'Log Bytes Flushed/sec';
 
WAITFOR DELAY '00:00:01';
 
SELECT instance_name, cntr_value 
INTO #Logflushes2
FROM sys.dm_os_performance_counters 
WHERE object_name  LIKE '%:Databases%' 
  AND counter_name = 'Log Bytes Flushed/sec';
 
SELECT 
     ag.name AS 'GroupName'
     ,db_name(drs.database_id) AS 'Database'
     --,(SELECT top 1 (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
     --,(SELECT top 1 (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
     ,drs.synchronization_state_desc AS 'State'
     ,drs.synchronization_health_desc 'Health'
     ,ags.primary_replica AS 'PrimaryReplica'
     ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed/sec'
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
FROM #Logflushes1 l1
JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
     ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
WHERE rs.role_desc = 'PRIMARY';

Thursday, February 14, 2019 - 6:32:35 PM - Alejandro Cobar Back To Top (79037)

Hi Susan,

I'm glad to hear that my content will help you in your day-to-day tasks.

In regards to the listener, the way I see it, it really depends on the current setup/purpose of your Availability Group:

  • If you have a setup that can perform an automatic failovers, then you can think of the listener as a "guide dog for a blind person". This is because an automatic failover can trigger at any given time and your replicas can change roles without you beign aware of (unless you have a monitoring mechanic in place of course), and connecting through the listener will lead you to the Primary Replica every single time.
    • Of course the listener can do more than just that, for instance in SQL Server 2016 you can configure read-only routing for your AvailabilityGroup, through the listener.
  • If you have a setup that can only perform manual failovers, then you will know which one is the Primary Replica (assuming the failover was triggered by you or you simply have a monitoring mechanism to be aware of such failover event). But even in this case, the listener keeps being as useful as the first case because you know which one is the Primary Replica... but the application won't, and the listener helps in the exact same way. 

If you don't setup/use a listener, then you probably will spend sometime with the application team when the application fails to write to a database. They probably will have their connection string pointing to a particular instance, and after a failover (manual/automatic) things will just crash and the DBA will have to come to the rescrue.

A "listener-less" setup is fine if all the interested parties (application team and DBAs) are aware that a coordinated effort is necessary when the time comes.

I hope you can find this useful.


Thursday, February 14, 2019 - 12:07:11 PM - Susan Brower Back To Top (79035)

Allejandro,

Thank you.  Thank you for providing these very useful tips.  I am sure I'will use them daily.

I have a general question:  One of the options to specify when creating an AO Group is Listeners.   I understand the definition of a listener, but I don't understand how the Listener should be used by the clients.  What is their purpose other than to provide an alternate way for clients to connect to the SQL server?  To me, just look like another way to specify aliases.

Your time and effort in  replying is appreciated.

Best Regards,  Susan Brower


Monday, December 3, 2018 - 1:50:57 PM - Alejandro Cobar Back To Top (78398)

Hi Roger,

Thanks for the feedback!

You are right, the object name was specific for default instances... but I have sent the fix so it can be updated in the tip. This time it will work for named instances as well.

I also tried the primary replica query in an instance with 3 Availability Groups and got no error.

*It would be very helpful if you can tell me the whole scenario where you made it crash. I tried in a SQL 2016 version and got no errors at all.

-Alejandro


Monday, December 3, 2018 - 8:30:10 AM - Roger Hullstrung Back To Top (78395)

 Hello Alejandro,

The primary replica query returns an error when executing against an instance that contains multiple AGs:

Msg 512, Level 16, State 1, Line 15

Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

Plus I had to comment out the object_name counter SQLServer:Databases, as this counter does not exist for named instance installs.

Roger

 


Monday, December 3, 2018 - 4:25:45 AM - Vesa Juvonen Back To Top (78394)

 

 Nice...

Msg 512, Level 16, State 1, Line 15

Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

 

 















get free sql tips
agree to terms