Aggregate Functions to Validate Data for SQL Server Replication and ETL Processes

By: Tim Smith | Updated: 2016-04-13 | Comments | Related: > Replication

Problem

We've experienced SQL Server replication problems in the past when we thought it was running normally because our reports indicated that source and destination row counts were similar or identical, but when we looked at the data, we discovered that they were different. Can we create a another check where we perform an aggregate function on a numeric column when we suspect that the row counts might not be accurate?

Solution

We can use an aggregate function on a numeric column to double verify, but let's first look at our row count report because, depending on how we're running it, we may be guaranteeing some problems.

In the tip Using PowerShell for SQL Server Replication Tracking, I use .NET's RowCount property for a reason: it's fast and the measurement of the source and destination will be similar in time. A few thousand tables can be checked within two minutes, unless we see network issues where it struggles to connect to either the source or destination server (and this would be a red flag for replication too - if that script struggles to connect to a server, replication will as well).

Imagine I used a hard count, where I actually performed a COUNT(*), locking the table - not only could I possibly delay replication in my check, the timing of one may finish before the other - creating the possibility of false alerts. No perfect way exists for monitoring replication flow, but one must be aware of the advantages and disadvantages of every approach, as their approach may create a problem of latency (I've seen many replication reports from developers that do this and when they complain about latency, I point out that bad code guarantees latency).

I highly advise against using aggregates for regular analysis of data involved in replication, unless you know specific down times where these will not interrupt any replication or reporting process. Even then, you should be very careful about how you will approach it and be willing to accept the reality that you may see some delays in your data. For instance, suppose I take the sum of two columns and the destination is different by 100, and we know replication is still running. Is that a cause for concern? This entire approach could open a statistical nightmare because in any given moment with replication, it could be off by 100.

You will have to not only know the average and deviation, but you'd need to know these values for different points of a load. In an example where I measure a sum after the Hang Seng closes - so in theory, replication should have finished, I could just as easily measure the RowCount property instead of performing an expensive sum. The bottom line: this approach will - from a statistical standpoint - possibly create false alerts, and you'll unintentionally be training people to ignore them, which is far worse than relying on mismatches or the command and transaction backlog (both of which together are precise and can be used with averages and deviations overall).

For ETL flows, this script will help since most ETL flows occur at specific times and should finish at specific times; in these cases, the below script makes sense to use. Those cautions being written, we can use an aggregate function - like SUM - to perform a check if we suspect there may be an issue with a particular table.

Function Get-CheckSumColumn {
    Param(
    [ValidateLength(7,15)][string]$readserver
    , [ValidateLength(50,4000)][string]$readquery
    , [ValidateRange(0,1)][int]$options
    )
    Process
    {
        $scon = New-Object System.Data.SqlClient.SqlConnection
        $scon.ConnectionString = "Data Source=$readserver;Initial Catalog=master;Integrated Security=true;"

        $cmd = New-Object System.Data.SqlClient.SqlCommand
        $cmd.Connection = $scon
        $cmd.CommandText = $readquery
        $cmd.CommandTimeout = 10
        
        try
        {
            $scon.Open()
            $sqlread = $cmd.ExecuteReader()
    
            while ($sqlread.Read())
            {
                if ($options -eq 0)
                {
                    $checksum_column = $sqlread["COLUMN_NAME"]
                }
                elseif ($options -eq 1)
                {
                    $checksum_column = $sqlread["CheckSumValue"]
                }
            }
        }
        catch [Exception]
        {
            Write-Warning "Get-CheckSumColumn ($options)"
            Write-Warning $_.Exception.Message
        }
        finally
        {
            $cmd.Dispose()
            $scon.Dispose()
        }

        return $checksum_column
    }
}

Function Get-CheckSumQueries {
    Param(
    [ValidateLength(3,75)][string]$readserver
    , [ValidateLength(3,75)][string]$distributiondb
    , [ValidateLength(3,100)][string]$article
    , [ValidateLength(3,100)][string]$source_database
    , [ValidateLength(3,100)][string]$destination_server
    )
    Process
    {
        $checksumquery = "
        SELECT DISTINCT
	        DENSE_RANK() OVER (ORDER BY (dist.publisher + ' ' + s.publisher_db + ' ' + a.article + ' ' + ss.srvname + ' ' + s.subscriber_db + ' ' + a.destination_object)) ID
	        , dist.publisher SourceServer
	        , s.publisher_db SourceDatabase
	        , ss.srvname DestinationServer
	        , s.subscriber_db DestinationDatabase
            , 'SELECT TOP 1 COLUMN_NAME FROM ' + s.publisher_db + '.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = ''' +  a.article + ''' AND DATA_TYPE IN (''numeric'',''int'',''decimal'',''float'')' AS CheckSumQuerySource
	        , 'SELECT TOP 1 COLUMN_NAME FROM ' + s.subscriber_db + '.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = ''' +  a.article + ''' AND DATA_TYPE IN (''numeric'',''int'',''decimal'',''float'')' AS CheckSumQueryDestination
        FROM MSArticles a  
	        INNER JOIN (
		        SELECT DISTINCT job_id
			        , publisher
			        , publication_id
			        , publisher_db
		        FROM MSreplication_monitordata
		        ) AS dist ON dist.publication_id = a.publication_id
			        AND dist.publisher_db = a.publisher_db
	        INNER JOIN MSpublications p ON a.publication_id = p.publication_id
		        AND p.publisher_db = a.publisher_db
		        AND a.publisher_id = p.publisher_id
	        INNER JOIN MSsubscriptions s ON s.publisher_db = dist.publisher_db
	        INNER JOIN master..sysservers ss ON s.subscriber_id = ss.srvid
	        INNER JOIN master..sysservers srv ON srv.srvid = s.publisher_id
	        INNER JOIN MSdistribution_agents da ON da.publisher_id = s.publisher_id  
			        AND da.subscriber_id = s.subscriber_id
			        AND da.publisher_db = dist.publisher_db
			        AND da.job_id = dist.job_id
        WHERE p.publication_type = 0
	        AND a.article = '$article'
	        AND s.publisher_db = '$source_database'
	        AND ss.srvname = '$destination_server'
        "
        
        $scon = New-Object System.Data.SqlClient.SqlConnection
        $scon.ConnectionString = "Data Source=$readserver;Initial Catalog=$distributiondb;Integrated Security=true;Connection Timeout=0;"

        $cmd = New-Object System.Data.SqlClient.SqlCommand
        $cmd.Connection = $scon
        $cmd.CommandText = $checksumquery
        $cmd.CommandTimeout = 0
        
        try
        {
            $scon.Open()
            $sqlread = $cmd.ExecuteReader()
    
            while ($sqlread.Read())
            {
                $checksum_source = $sqlread["CheckSumQuerySource"]
                $checksum_destination = $sqlread["CheckSumQueryDestination"]

                $checksum_sourceserver = $sqlread["SourceServer"]
                $checksum_destinationserver = $sqlread["DestinationServer"]
                $checksum_sourcedatabase = $sqlread["SourceDatabase"]
                $checksum_destinationdatabase = $sqlread["DestinationDatabase"]

                $checksum_sret =  Get-CheckSumColumn -readserver $checksum_sourceserver -readquery $checksum_source -options 0
                $checksum_dret = Get-CheckSumColumn -readserver $checksum_destinationserver -readquery $checksum_destination -options 0

                $sourcecheck = "SELECT SUM($checksum_sret) AS CheckSumValue FROM $checksum_sourcedatabase.dbo.$article WITH(NOLOCK)"
                $destinationcheck = "SELECT SUM($checksum_dret) AS CheckSumValue FROM $checksum_destinationdatabase.dbo.$article WITH(NOLOCK)"

                $checksum_svals =  Get-CheckSumColumn -readserver $checksum_sourceserver -readquery $sourcecheck -options 1
                $checksum_dvals = Get-CheckSumColumn -readserver $checksum_destinationserver -readquery $destinationcheck -options 1

                Write-Host $checksum_svals
                Write-Host $checksum_dvals
            }
        }
        catch [Exception]
        {
            Write-Warning "Get-CheckSumQueries"
            Write-Warning $_.Exception.Message
        }
        finally
        {
            $cmd.Dispose()
            $scon.Dispose()
        }
    }
}

Get-CheckSumQueries -readserver "" -distributiondb "" -article "" -source_database "" -destination_server ""

A couple of notes on the above script:

This script assumes that the source and destination tables are identical; I also use a T-SQL query to get the source and destination information by using the table name, the destination server and the source database.
I limit the timeout to get both the COLUMN_NAME and the CheckSumValue by ten seconds. If it can't return it that quickly, I don't want it to continue. The key here is that this isn't important, this is just a check.
I scale distribution databases by publisher, so I try to avoid using the distribution database for anything. Every distribution database - in my opinion - needs to be named distribution_publisher, such as distribution_hangseng, distribution_bitcoin, distribution_shanghai, etc. For this reason, this script requests which distribution database a developer is using.
The Get-CheckSumColumn returns both the column name and the value, using an if statement on $options. Keep this in mind if the script hits the catch block and errors out - the error outputs the option.
The script returns one value from the INFORMATION_SCHEMA, looking at the numeric columns of either int, numeric, decimal, or float. If you use other data types, such as smallint, or tinyint, you may also want to include them.

Finally, anyone familiar with the law of large numbers knows that at any given moment, this could provide false information for replication too because it's statistically possible that both columns could equal each other, yet have different data relative to the data set. This is another reason to use this only in those certain cases. For ETL applications and time loads, this will help provided that you use columns (you may want to specify which ones instead of pulling anyone in particular) where you know the data will be different after a new load.

Next Steps

Define a scenario before you would use this approach to validate data flow, noting that the above script can be used in ETL processes as well.
When using this, cross check your results with what you see for row miscounts and transaction and command backlogs.
Check out all of the SQL Server Replication tips.

About the author

Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Article Last Updated: 2016-04-13

Aggregate Functions to Validate Data for SQL Server Replication and ETL Processes

Problem

Solution

Next Steps

About the author

Comments For This Article