Queries To Debug SQL Server Replication Issues

By:   |   Comments (7)   |   Related: > Replication


Problem

This tip provides some quick queries to get started when a replication issue arises that may help minimize the amount of troubleshooting time. Consider that to reduce the noise, with some of these queries you will want to apply a time-based filter (depends) while with others, you may just want the status at any given moment (doesn't matter).

Solution

I prefer scripts over interfaces because I'm seldom at the mercy of the speed of an interface (slow) and I can aggregate the output I receive from scripts, making it easy to identify if there's an issue by collecting messages from different sources and comparing them. In these queries, I may have commented out the *, but you can include all columns as there may be other columns worth considering:

Replication Status Queries

The below query provides me with an overview of what the replication monitor provides and I've updated the status column to reflect what Microsoft provides in their documentation as the meaning for each of the status' numbers. With this query, I am paying close attention to the StatedStatus and the last_distsync columns. This does not mean there is not an error, however.

SELECT publisher
    , publisher_db
    , publication
    , agent_name
    , last_distsync
    , CASE
        WHEN status = 1 THEN 'Started'
        WHEN status = 2 THEN 'Succeeded'
        WHEN status = 3 THEN 'In progress'
        WHEN status = 4 THEN 'Idle'
        WHEN status = 5 THEN 'Retrying'
        WHEN status = 6 THEN 'Failed'
    END AS StatedStates
    --, *
FROM distribution..MSreplication_monitordata

Query to Find Replication Errors

One of the most important tables involved in replication issues and errors is the below table. I've ordered the results of the table here by the most recent time, and you can apply a time filter here if you are only concerned with the last few days, hours, etc. We may find a wide variety of errors, such as "Query Timeout Error", "The row was not found at the subscriber", etc. The trouble here is that replication's status may appear to be fine, while you see here that, for an example, a particular publication is skipping data. I once worked with a trader who told me, "I don't care about having no data, but I do care about having bad data." There is nothing worse than thinking something is functioning when it's actually not functioning. The key with this table is looking at the time and the error_text.

SELECT 
	error_text
	, [time]
	,--*
FROM distribution..MSrepl_errors
ORDER BY [time] DESC

Query to Read the Error Log for Replication Errors

We can also double verify with the error log about replication; be careful using the below query, as my filter might not apply in your environment, and you may want to avoid using these filter strings:

-- parameter one: 0 means obtain the latest error log
-- parameter two: 1 tells it to include the agent
-- parameter three: find a string with ...
-- parameter four: find a string with what's in parameter three and here
EXEC sys.sp_readerrorlog 0,1,'Replication','-'

Status Information from Replication Distribution Agents

While the above queries provide useful information for errors, I also want to be able to determine what's happening, especially if something appears wrong. The below query is helpful in providing information regarding the distribution agents:

SELECT 
    a.name PublicationName
    , a.publication Publication
    , ditosu.comments AS MessageText
    , ditosu.[time] CommandDate
    , ditosu.xact_seqno xact_seqno
FROM MSdistribution_agents a
    INNER JOIN MSpublications p ON a.publisher_db = p.publisher_db
        AND a.publication = p.publication
    INNER JOIN MSdistribution_history ditosu ON ditosu.agent_id = a.id
-- Apply a filter here can minimize the noise
ORDER BY ditosu.[time] DESC

The key that I look for here are the MessageText, which communicate the historic information for the distribution agents. Depending on the amount of publications, you may also see a lot of messages indicating that nothing is available for replication. A word of caution on this table is that Microsoft notes that this may be subject to change in future releases. For quickly identifying a possible problem, I cannot emphasize good naming convention here enough: with good naming convention, you'll be able to easily apply filters, but poor naming convention will result in noise that may translate into some confusion when something is failing; this will also depend on how many subscribers exist per publication.

PowerShell Script to Compare Article Row Counts

Finally, if I'm interested in a particular publication that may seem delayed, the below PowerShell script will quickly ping the information by reading from the RowCount properties provided in the management objects library:

Function QuickCheck-Replication {
    Param(
        [ValidateLength(4,30)][string]$source_server
        , [ValidateLength(4,30)][string]$source_db
        , [ValidateLength(4,30)][string]$destination_server
        , [ValidateLength(4,30)][string]$destination_db
        , [ValidateLength(4,30)][string]$article
    )
    Process
    {
        Function Loop-ReplicationTables {
            Param(
                [ValidateScript({Test-Path $_})][string]$smolibrary
                , [ValidateLength(4,30)][string]$articleserver
                , [ValidateLength(4,30)][string]$articledatabase
                , [ValidateLength(5,50)][string]$article
            )
            Process
            {
                $nl = [Environment]::NewLine
                Add-Type -Path $smolibrary

                $srv = New-Object Microsoft.SqlServer.Management.SMO.Server($articleserver)

                [int]$reptab_rwcnt = $srv.Databases["$articledatabase"].Tables["$article"].RowCount
        
                if ($reptab_rwcnt -ne $null)
                {
                    return $reptab_rwcnt
                }
            }
        }

        Function LoopSave-TransReplication {
            Param(
                [ValidateLength(4,30)][string]$source_server
                , [ValidateLength(4,30)][string]$source_db
                , [ValidateLength(4,30)][string]$destination_server
                , [ValidateLength(4,30)][string]$destination_db
                , [ValidateLength(4,30)][string]$article          
            )
            Process
            {
				### Note that I've hard-coded the SMO library here for a 2014 Instance
				### 2014: 120
				### 2012: 110
				### 2008R2: 100
				### 2008: 100
				### 2005: Upgrade (90)
                $smo = "C:\Program Files (x86)\Microsoft SQL Server\120\SDK\Assemblies\Microsoft.SqlServer.Smo.dll"

                [int]$src = Loop-ReplicationTables -smolibrary $smo -articleserver $source_server -articledatabase $source_db -article $article
                [int]$dest = Loop-ReplicationTables -smolibrary $smo -articleserver $destination_server -articledatabase $destination_db -article $article
                [int]$smosaveddiff = ($src - $dest)

                $smo_srcob = "$source_server.$article"
                $smo_dstob = "$destination_db.$article"

                Write-Host $smosaveddiff
            }
        }

        LoopSave-TransReplication -source_server $source_server -source_db $source_db -destination_server $destination_server -destination_db $destination_db -article $article
    }
}

QuickCheck-Replication -$source_server "OURSRCSRV\IN" -$source_db "OurSrcDb" -$destination_server "OURDBSRV\IN" -$destination_db "OurDestDb" -$article "OurArticle"

Two cautions here: this script assumes that the source and destination article matches - it may not. It also has the SMO library built in for the SQL Server version 2014 (120). If you're using another version, like 2012, then switch the 120 to 110. This script can quickly help me check if the data are flowing, without interrupting the flow. You can find many internet debates about using the properties as opposed to a direct query, however, in testing this approach I find .NET's library to be fast and effective while generally providing an accurate estimate. Provided that someone understand this property versus a direct count, it is helpful.

Summary

These queries provide some useful tools depending on the particular issue, though, unfortunately, these queries cannot solve:

  1. Distribution architecture that is not configured for an OLTP load.
  2. Processes that have heavy interference with replication, and do little for end users, or that have superior alternatives that don't interfere with replication.

Sometimes the above two items are the fundamental cause and as the old saying goes, "No one raindrop thinks it caused the flood."

Next Steps
  • Test the above queries in your replication environment.
  • With some of these queries, become familiar with the messages, as it can appear noisy at first glance (the fourth one especially).
  • Check out these other Replication tips


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Thursday, October 15, 2020 - 5:18:23 AM - ying Back To Top (86643)
the query was very helpfull
now we can automate the monitoring for replication errors
thanks

Friday, September 2, 2016 - 6:05:17 PM - Rhostam Back To Top (43261)

Very thoughtful writeup. Much appreciated. I particularly liked the Powershell script. Creative way to get a quick summary.

 

Thanks, 

 


Tuesday, May 10, 2016 - 4:51:31 PM - Tim Back To Top (41450)

@Curtis

Does your source table match your destination table?  If not, you might get an error - the above script looks at the same table for the source and destination.  If your environment doesn't, that may be an issue.


Monday, May 9, 2016 - 4:52:22 PM - Curtis Back To Top (41441)

 

Figured it out I think!!.

Had to put and '=' sign in for each parameter.  -$source_server="APPSVR5408"

Not sure why

Now the script ends with an error...

Cannot index into a null array.
At line:25 char:80
+                 [int]$reptab_rwcnt = $srv.Databases["$articledatabase"].Tables[ <<<< "$article"].RowCount
    + CategoryInfo          : InvalidOperation: (-=APPPRoduction_Publication:String) [], RuntimeException
    + FullyQualifiedErrorId : NullArray

which I will try to figure out.


 

 

 


Monday, May 9, 2016 - 4:42:23 PM - Curtis Back To Top (41440)

Here is what I run... (names are changed but are the same length as shown)

 QuickCheck-Replication -$source_server "APPSRV5408" -$source_db "APPProduction" -$destination_server "APPSRV5406" -$destination_db "APPProduction_Replication" -$article "APPPRoduction_Publication"

And I get the error about the parameter 'source_server'. (which is the first parameter)

Messing around.. If I remove the '$' from the source_server parameter and run it,  I now get the same error for the 'source_db' parameter. (the next parameter with a $)

QuickCheck-Replication -source_server "APPSRV5408" -$source_db "APPProduction" -$destination_server "APPSRV5406" -$destination_db "APPProduction_Replication" -$article "APPPRoduction_Publication"

If I remove all '$' from the statement the query runs (for a while) but it doesn't return anything.

 

 

 

 


Monday, May 9, 2016 - 2:07:56 PM - Tim Back To Top (41437)

@Curtis

Yes, there is parameter validation with the script and a person would enter the appropriate values for the parameters:

What are your server names, database names, and article name?
-$source_server ""
-$source_db "" 
-$destination_server ""
-$destination_db ""
-$article ""

In addition, if your names don't follow the restrictions in the validation of "[ValidateLength(4,30)]", then you'll either need to expand or shrink the validation.


Monday, May 9, 2016 - 1:04:12 PM - Curtis Back To Top (41436)

Cannot get the powershell script to work.

"Cannot validate argument on parameter 'source_server'. The number of characters (1) in the argument is too small. Specify an argument whose length is greater than or equal to "4" and then try the command again."

Are the -$source_server "OURSRCSRV\IN" parameters just replaceable items?

Seems that the -$source_server "servername" (and all other parameters) are causing the problem.

I simply replaced "OURSRCSRV\IN" with "myservername"  (and the other parameters also).

Please help.

 















get free sql tips
agree to terms