Auto Fix SQL Server Transactional Replication Error 20598 for UPDATEs

By:   |   Comments (6)   |   Related: > Replication


Problem

When using SQL Server replication, from time to time we will see replication error 20598 which is a pain to fix without having to resort to snapshot synchronization or error skipping. In this tip I show you an easier approach to fix these issues.

Solution

In SQL Server transactional replication, error 20598 refers to a missing row on the subscriber side, and there are two scenarios that can cause this error:

  1. An UPDATE command cannot be replicated because there is no record that matches the update condition on the subscriber side.
  2. A DELETE command cannot be replicated because there is no record that matches the delete condition on the subscriber side.

In this tip, we will address the UPDATE scenario.

Manual Way to Fix SQL Server Error 20598 for an UPDATE

The traditional manual way to fix this is:

  1. Start the replication monitor and in the replication monitor you can see the error message as shown in Fig_1 below

  2. Copy the transaction sequence number, in this case it is:0x0000003E00000041000800000000 and notice Command ID 2 in the message

  3. Open an SSMS window and connect to the distributor and run the following command and you will get results similar to Fig_2 below

    USE Distribution;
    EXEC sp_browsereplcmds '0x0000003E00000041000800000000', '0x0000003E00000041000800000000'
    GO

  4. Focus on the [command] and [command_id] columns. Copy the value from the [command] column where [command_id]=2

  5. From step 4, you can figure out the table name on which the replication command fails. The table name is embedded in the system generated stored procedure name. In my case, the [command] column value (for command_id=2) is as shown below.  The stored procedure is sp_MSupd_dbot2 and from this we can tell that my table is dbo.t2.

    {CALL [sp_MSupd_dbot2] (,,,,,'nice 3',3,'good',2014-08-31 07:49:47.960,{C1479523-6839-4C90-9429-EE31CD2D5831},0x20)}

    Open an SSMS window and connect to subscriber database and exec sp_helptext sp_MSupd_dbot2.  From this you can figure out the missing row based on the values of the parameters of sp_MSUpd_dbot2 (i.e. @pkc1,...,@pkcN)

  6. You can then insert the missing row on the subscriber table using the following format:


  7. insert into <subscriber_server>.<sub_db>.<schema>.<table_name> (<column_list>)
    select (<column_list>)
    from <publisher_server>.<pub_db>.<schema>.<table_name>
    where pk_col_1 = <value1> and pk_col_2 =  <value2> and pk_col_n = <valuen>
    

    or you can export the missing data from the publishing table to a file and then import the data back to the subscription table.
Repl_Error_20598.jpg
Fig_1: Replication Error 20598 in Replication Monitor
sp_browsereplcmds result
Fig_2: SP_BrowseReplCmds result

Technical Challenges to Automate this Process

Basically, we will try to mimic the steps as outlined in the manual steps mentioned above. However, there are a few technical challenges before we can have a qualified solution.

  • We need to extract the table name based on the result of sp_browsereplcmds.
  • We need to find the information about the primary key columns and their matching values for the table, so we can select the missing row from the publication table.
  • We need to move the data from the publisher table to the subscriber table without any changes to the existing environment, such as adding a linked server or creating a specific folder/file etc.

PowerShell Solution to Automate Fixing Error 20598 for UPDATEs

The solution will use PowerShell to:

  • connect to each replication server (Distributor, Publisher, Subscriber)
  • then check dbo.MSRepl_errors for error code 20598 and grab the [xact_seqno] and [command_id] values
  • use this data to connect to the distributor
  • run sp_browsereplcmds to retrieve the table name and the primary key values
  • use the table and the primary key values to retrieve the missing row from the publication table
  • and then insert the row into the subscription table.
<#Assumption: 1. the schema of all the publishing tables remains the same, and by default, it is 'dbo'
            2. The computer on which this script runs has sql server 2012 PowerShell module installed
            3. The computer on which this script runs has PowerShell V3 or above and .net 3.5 or above installed
Usage:   1. modify $schema / $PubSvr / $PubDB / $SubSvr / $SubDB / $Distributor to be your environment settings
         2. Test in the test environment by initially set $debug = $true, recommended to run in the ISE first to get familiar with the script
         3. When there are potential many rows missing (for update), please set $duration to a proper time (like 10 min), 
            and it will loop to fix all the missing rows one by one.
#>
#Requires -Version 3.0
import-module sqlps -DisableNameChecking;

[int] $duration = 3 # in minutes, how long this script will run, this is to handle multiple missing rows
[boolean]$debug = $true; # when $true, it will print out a query statement that can be executed on the subscriber server to insert the missing row
                         # it will not do any real fix work as such it will only print out the first fixing query.
[string]$schema='dbo';# the schema of the all publishing tables, 
[string]$PubSvr = 'tp_w520\sql2k8r2'; #publisher, in my example, this is sql server 2008R2
[string]$PubDB = 'testdb'; #publication database
[string]$SubSvr = 'tp_w520';#subscriber, in my example, this is sql server 2012
[string]$SubDB = 'testdb'; #subscription database
[string]$Distributor = 'tp_w520';#replication distributor

[datetime]$end_time = (get-date).AddMinutes($duration);

#first let's see what's the latest error 20598
[string]$maxid = '0'; 
[string]$xact_seq='0x00';
[string]$command_id='0'

$qry = "select id from dbo.MSPublisher_Databases where publisher_db='$pubdb';"
$result2 = invoke-sqlcmd -ServerInstance $Distributor -database Distribution -Query $qry;

while ((get-date) -le $end_time) {
    $qry = @"
    declare @id int;
    select @id = max(id) from dbo.MSRepl_Errors where id > $maxid and error_code=20598 and time > dateadd(hour, -1, getdate())
    and  (xact_seqno<>$xact_seq or command_id <> $command_id);
    select top 1 id=cast(id as varchar(8)), xact_seqno=convert(varchar(42), xact_seqno,1), command_id=cast(Command_id as varchar(8)) from dbo.MSRepl_Errors where id = @id
"@
    $result = $null;
    $result=invoke-sqlcmd -ServerInstance $Distributor -Database Distribution -Query $qry;

    if ($result -eq $null) {
        start-sleep -Seconds 15;
        continue;
    }
    else {
        $maxid = $result.ID;
        $xact_seq=$result.xact_seqno;
        $command_id = $result.command_id;
    }

    $qry = @"
    exec sp_browsereplcmds @xact_seqno_start='$($result.xact_seqno)', @xact_seqno_end='$($result.xact_seqno)'
    , @command_id=$($result.command_id), @publisher_database_id=$($result2.id);
"@;

    $result3 = invoke-sqlcmd -ServerInstance $Distributor -database Distribution -Query $qry; 
    [string]$odbc_qry = $result3.command;

    [string]$reg_pattern ="sp_msupd_$($schema)(.+?)\]"
    if ($odbc_qry -match $reg_pattern) {
        $TableName ="$($schema).$($matches[1])"
    }
    else {
       write-error 'The current error cannot be solved by this solution, which aims only at missing rows when updating.'}

    #need to get the parameter values and put into $pm
    if ($odbc_qry -match "(\{CALL \[sp_MSUpd_.+)\]") {
       if ($odbc_qry -match "\((.+?)\)") {
        [string]$pm = $matches[1];
       }
       else {
        write-error "cannot find parameters";
        return;
       }
    }
    else {
        write-error 'The current error cannot be solved by this solution, which aims only at missing rows when updating.'
        return;
    }

    #next we need to script out the columns of the Subscription Table
    $svr = new-object "microsoft.sqlserver.management.smo.server" $SubSvr
    $db = $svr.databases[$subdb];

    $tbl=$db.Tables[$TableName.split('.')[1], $TableName.split('.')[0]];

    [string]$col_list = "";

    $tbl.Columns |  % {$col_list += $_.Name + ',' };
    $col_list = $col_list.remove($col_list.length-1);
 
    #we need to grab the where clause in the replication update stored proc of the subscription table
    $p = $db.StoredProcedures["sp_MSUpd_$($TableName.replace(""."", """"))"];

    if ($p -eq $null) {
        write-error "cannot find the stored proce of sp_MS$($Fix_Type)_$($TableName.replace(""."", """"))"
        return;
    }

    $ar = $pm.split(",")
    [int]$i = 0;
    $where_clause = @{};
    while ($i -lt $p.parameters.count ) {
        if ($p.parameters[$i].name -match '@pkc[0-9]{1,2}') {
            switch ($p.parameters[$i].DataType.Name) {
             'datetime' { $ar[$i] = "'" + $ar[$i] + "'"; break;}
             'uniqueidentifier' {$ar[$i]=$ar[$i].replace("{", "'");$ar[$i]=$ar[$i].replace("}", "'"); break;}
            }
             $where_clause.add($p.parameters[$i].name, $ar[$i]);
            #$where_clause. += New-Object -TypeName PSObject -Property @{ $($p.parameters[$i].name)=$ar[$i]; }
            
        }
        $i += 1;
    }

    $re = new-object "System.Text.RegularExpressions.regex" ("where.+?(?=if)", [System.Text.RegularExpressions.RegexOptions]::Singleline)
    $where=$re.match($p.TextBody) 
    [string]$str = $where.Value;

    foreach ($k in $where_clause.keys) {
        $str=$str.replace($k, $where_clause[$k]);
    }

    [string]$qry = "select $col_list from $TableName $str";

    if ($debug) {
        "insert into $subsvr.$subdb.$tablename ($col_list) `
         select $col_list from $pubsvr.$pubdb.$TableName $str;
        "
    }
    else { # start to auto insert the missing row
        [System.Data.datarow]$r=$null;

        $r = invoke-sqlcmd -ServerInstance $SubSvr -Database $SubDB -Query $qry;

        if ($r -eq $null) {#if subsvr.subdb does not have the row
        #start to insert the missing row from the pubsvr.pubdb.tablename
            [string]$conn =  "Data Source=$SubSvr; trusted_connection=true; initial catalog=$SubDB;"
            $sqlbcp = new-object "system.data.sqlclient.sqlbulkcopy" ($conn, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity)
            $r=$null
            $r = invoke-sqlcmd -ServerInstance $PubSvr -Database $PubDB -Query $qry;

            $sqlbcp.DestinationTableName = $TableName;

            $tbl.Columns | % {$sqlbcp.ColumnMappings.Add($_.name, $_.name) } | out-null;
            $sqlbcp.WriteToServer($r);
       }#if subsvr.subdb does not have the row
    } #end to auto insert the missing row
    start-sleep -Seconds 15; # repl error will take some time before being written to distribution.dbo.msrepl_errors table.
}

How to Test

To test this fix do the following:

  • Set up replication with an article in a test environment and add some data to the table. 
  • Then delete multiple rows on the subscriber
  • Then update these deleted rows on the publishing table.
  • Then start the replication monitor to see that the replication error 20598 occurs (just as Fig_1 shows).
  • Now run this script in a PowerShell ISE (by providing the correct replication servers information in the script) and the replication errors should disappear.

Note: The replication distribution agent may stop after a bunch of errors. If so, just restart it so the error can be logged into the distribution.dbo.MSRepl_Errors table.

Next Steps

The script can be run from any computer as long as you can connect to the distributor, publisher and subscriber from that computer.

  • Test the script in a test environment
  • Read this tip to learn about Handling Data Consistency Errors in SQL Server Transactional Replication
  • You can customize the script to adapt to your needs. For example, in my environment I created a stored procedure [uspA] with parameters for distributor/publisher/subscriber and I also created a table [tblReplicationConfig] which will be populated by [uspA].  The PowerShell script is customized to retrieve the data from [tblReplicationConfig] and the PowerShell script is embedded in a job step of [jobA]. The [uspA] will first populate [tblReplicationConfig] with the input parameters and then start the Job [jobA].  The PowerShell script in [jobA] will then pick up the information from [tblReplicationConfig] and then try to fix the error. So the scenario is: when I receive replication 20598 errors, I run [uspA] with parameters and then [jobA] will fix the replication error if these 20598 errors are caused by an UPDATE on a missing row.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Wednesday, October 26, 2022 - 8:50:03 AM - Alex Back To Top (90634)
Hi,

We have a "Spanish Environment" and when we use "sp_browsereplcmds" to browse the command, if the table has any real or float column, the query output contains a comma ',' instead of dot '.' as decimal separator for non integer parameter values.

In that case we canīt count then number of commas to determine the parameters values, because the number of commas depends on parameters values.

The question is if we can set any kind of session variable or something like that to get the sp_browsereplcmds's output with dot '.' as decimal separator.

We have try changing the language of the session to English and changin the windows decimal separator of the computer where we run the script.

We not use your powershell solution, we use our own pure transact-sql script, but the idea is the same.

Thanks,

Alex

Friday, March 6, 2020 - 12:48:31 PM - Kevin Sanders Back To Top (84974)

Wow.  Remarkably thorough and informative post.  Thanks much.  Had this exact scenario this morning.  Took the long way around today with a resync, but this will help me automate a fix and a monitoring process to handle this before the users start to scream.  thanks again.  Kevin Sanders


Wednesday, January 7, 2015 - 2:17:49 PM - jeff_yao Back To Top (35860)

Thanks Greg.

Jeffrey


Wednesday, January 7, 2015 - 1:39:27 PM - Greg Robidoux Back To Top (35858)

I updated the script and removed the very last }.

-Greg


Wednesday, January 7, 2015 - 1:25:33 PM - jeff_yao Back To Top (35857)

Thanks Steve, the bottom } should be removed, I guess it is my editing issue (i.e. during deletion, and copy & paste, I somehow did not delete cleanly).

Please let me know if you have any other issue.

 


Wednesday, January 7, 2015 - 11:55:33 AM - Steve Back To Top (35856)

When trying to run your code in PS, I am receiving the following error:

At line:152 char:1

 

+ }

 

+ ~

 

Unexpected token '}' in expression or statement.

 

+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException

 

 

+ FullyQualifiedErrorId : UnexpectedToken















get free sql tips
agree to terms