By: Jeffrey Yao | Updated: 2016-11-15 | Comments (1) | Related: More > Comparison Data and Objects
Problem
From time to time I need to compare data in two tables, which may be on different SQL Servers. For example, in a distributed environment with SQL Server replication, say, we have one central publisher and several subscribers. At a point of time, I want to know whether products of a specific type are updated to be the same price, warranty, specification, etc. at all the different subscriber sites. In our multi-server environment, there are also times when we need to compare tables on different servers that are not part of SQL Server replication.
Solution
How to compare data in SQL Server tables (or compare table structures) has been a topic for long time. Other than the TableDiff utility, most solutions are based on T-SQL (see links in the Next Steps section), but they are not flexible enough. For example, I can see the following deficiencies:
- We want to compare tables on different servers where linked servers are not allowed (tablediff can do this).
- We consider a bigint 10 is different from a regular int 10, or 1.01 of decimal(5,2) is different from decimal(6,2).
- When a table has a column with a data type such as [text], [image], some T-SQL solutions won't work.
- When comparing big tables (like with a 200+ thousands records), it may take a long time.
- TableDiff utility will stop working if table schemas are different.
Most of the time we only need a Yes or No regarding whether the table records are the same. If Yes, continue our work, if No, we will do some type of reconciliation to fix the issue.
The algorithm of this solution is as follows:
- Dump the data from the two tables to two data files using BCP queryout
- Read the two data files and get a MD5 hash value of each file with .Net MD5 class
- Compare the MD5 hash values and if they are the same, then Yes (i.e. data are the same in the two tables), otherwise No
The advantages of this method is that we can parameterize all the criteria, such as SQL Server instance name, database name, query, etc., and most importantly it addresses all the deficiencies of the T-SQL solutions mentioned above.
Sample Data
We will prepare some sample tables and data.
use mssqltips; -- create two tables, the only difference is [Salary] column data type if object_id('dbo.s', 'U') is not null drop table dbo.s; create table dbo.s (id int identity primary key, [Name] varchar(50), [DOB] datetime, [Address] varchar(200), [Salary] decimal(8,2)); if object_id('dbo.t', 'U') is not null drop table dbo.t; create table dbo.t (id int identity primary key, [Name] varchar(50), [DOB] datetime, [Address] varchar(200), [Salary] decimal(9,2)); GO -- populate the two tables with sample data insert into dbo.s ([Name], [DOB], [Address], [Salary]) values ('Mark', '1980-01-01', '111 ABC street, Seattle, WA', 108888.11) ,('James', '1970-01-01', '222 XYZ street, Vancouver, BC', 122333.22) ,('Jenny', '1982-11-01', '333 DEF Avenue, London, ON', 88999.33) ,('Mary', '1990-02-02', '444 GHI Road, Edmonton, AB', 77777.44) ,('John', '1988-05-11', '555 JKL Blvd, Houston, IL', null); -- copy the exact records from source table to target table insert into dbo.t ([Name], [DOB], [Address], [Salary]) select [Name], [DOB], [Address], [Salary] from dbo.s go
Here is the PowerShell code, you can include this in a module. For easy demonstration purposes, you can just copy and paste the code into a PS ISE window and run it.
<# .Synopsis Compare two sql tables or two result sets from sql queries .DESCRIPTION Compare two sql tables or two result sets from sql queries .EXAMPLE Compare-QueryResult -SourceServer . -SourceDB MSSQLTips -SourceQuery 'select * from dbo.s order by id' -TargetQuery 'select * from dbo.t' .EXAMPLE #we compare value only between the two query resultsets. Compare-QueryResult -SourceServer . -SourceDB MSSQLTips -SourceQuery 'select * from dbo.s order by id' -TargetQuery 'select * from dbo.t' -ValueOnly .INPUTS -SourceServer: A sql server instance name, default to local computer name -SourceDB: A sql database on $SourceServer, default to 'TempDB' -SourceQuery: A query against SourceServer\SourceDB (mandatory) -TargetServer: A sql server instance name, default to $SourceServer -TargetDB: A sql database on $TargetServer, default to $TargetDB -TargetQuery: A query against TargetServer\TargetDB (mandatory) -ValueOnly: compare value only, so if $SourceQuery get a value of 10 from column with datatype [int], and $TargetQuery get a value of 10 from a colum with datatype [bigint], the value are the same. -TempFolder: this is a folder where we need to put temporary files, default to c:\temp\. If you do not have this folder created, an error will occur. .OUTPUTS A string value of either 'Match' or 'UnMatch'; #> #Requires -Version 3.0 function Compare-QueryResult { [CmdletBinding()] [OutputType([string])] Param ( # Param1 help description [Parameter(Mandatory=$false, Position=0)] [string] $SourceServer = $env:ComputerName , [Parameter(Mandatory=$false)] [String] $SourceDB='tempdb', [Parameter(Mandatory=$true)] [String] $SourceQuery, [Parameter(Mandatory=$false)] [String] $TargetServer='', [Parameter(Mandatory=$false)] [String] $TargetDB='', [Parameter(Mandatory=$true)] [String] $TargetQuery, [Parameter(Mandatory=$false)] [switch] $ValueOnly, [Parameter(Mandatory=$false)] [String] $TempFolder='c:\temp\' ) begin { [string]$src_result=''; [string]$dest_result=''; } Process { if (-not (test-path -Path $TempFolder) ) { Write-Error "`$TempFolder=[$TempFolder] does not exist, please check !"; return; } #prepare the necessary variables [string]$source_bcp_file = "$TempFolder\src_bcp_out.dat"; # the output file is hard-coded here, but you can change it to your own needs [string]$target_bcp_file = "$TempFolder\dest_bcp_out.dat"; if ($TargetServer -eq '') { $TargetServer = $SourceServer;} if($TargetDB -eq '') {$TargetDB = $SourceDB;} #bcp data out to files if ($ValueOnly) { bcp "$SourceQuery" queryout "$source_bcp_file" -T -S "$SourceServer" -d $SourceDB -t "|" -c | out-null; bcp "$TargetQuery" queryout "$target_bcp_file" -T -S "$TargetServer" -d $TargetDB -t "|" -c | out-null; } else { bcp "$SourceQuery" queryout "$source_bcp_file" -T -S "$SourceServer" -d $SourceDB -t "|" -n | out-null; bcp "$TargetQuery" queryout "$target_bcp_file" -T -S "$TargetServer" -d $TargetDB -t "|" -n | out-null; } #create MD5 [System.Security.Cryptography.MD5] $md5 = [System.Security.Cryptography.MD5]::Create(); #read source file [System.IO.FileStream] $f = [System.IO.File]::OpenRead("$source_bcp_file"); #hash the src file [byte[]] $hash_src = $md5.ComputeHash($f); $f.Close(); #read the target file [System.IO.FileStream] $f = [System.IO.File]::OpenRead("$target_bcp_file"); #hash the target file [byte[]] $hash_dest = $md5.ComputeHash($f); $src_result=[System.BitConverter]::ToString($hash_src); $dest_result=[System.BitConverter]::ToString($hash_dest); #cleanup $f.Close(); $f.Dispose(); $md5.Dispose(); del -Path $source_bcp_file, $target_bcp_file; #compare the hash value } End { if ($src_result -eq $dest_result) { write-output 'Match'} else { Write-Output 'UnMatch'; } } } #Compare-QueryResult
After running the script in one ISE window, open a new ISE window and run the following code (assuming the target server and database are the same as the source, so they do not need to be explicitly added here).
#After running Compare-QueryResult.ps1, we can run the following test in a new window #we will get "UnMatch" because [Salary] columns have different datatype Compare-QueryResult -SourceServer . -SourceDB mssqltips ` -SourceQuery "select * from dbo.s" -TargetQuery "select * from dbo.t"; #now let's compare VALUE only, we will get 'Match' Compare-QueryResult -SourceServer . -SourceDB mssqltips -ValueOnly ` -SourceQuery "select * from dbo.s" -TargetQuery "select * from dbo.t";
You will see that when you run the first cmdlet function, it will return "UnMatch", this is because [Salary] columns are different in dbo.s and dbo.t, one is of data type decimal(8,2) and the other is decimal(9,2), even though both tables have exactly the same count of rows with the same values.
However, when you run the second cmdlet function with parameter -ValueOnly, (comparing values only), we will get a "Match" as shown below.
Now let's do another example. We will update one record in dbo.t as follows and then re-run the cmdlet with parameter -ValueOnly and we expect an "UnMatch". Let's see what happens.
-- if I update the [Salary] for Mark from 108888.11 to 108888.12 update t set [Salary]=108888.12 -- originally is 108888.11 from dbo.t where [Name]='Mark';
Re-run the second cmdlet in the PS ISE Window.
Compare-QueryResult -SourceServer . -SourceDB mssqltips -ValueOnly ` -SourceQuery "select * from dbo.s" -TargetQuery "select * from dbo.t";
Just as expected, we get an 'UnMatch' as shown below.
Usage
There can be a few usage scenarios of this PowerShell function for parameter -ValueOnly:
- Without using the parameter -ValueOnly, if 'Match' returns, it means the two tables are exactly the same in both data and table schemas (i.e. column numbers and data types). If 'UnMatch' returns, it can be either a data discrepancy or table metadata discrepancy, so we should run the next step.
- If we re-run the cmdlet with parameter -ValueOnly, if 'Match' returns, it just means that the table metadata is different but if 'UnMatch' returns, it means the table data are indeed different, but we still cannot rule out whether the table metadata is different as well.
- Most of the time, we already know the table schemas are the same and we only care about value matching. In such cases, using or not using -ValueOnly are the same.
- In my test, if a column is varchar(N) in table dbo.s, then for the corresponding column in dbo.t, even if the datatype is varchar(N+10), this PS function (without -ValueOnly) still considers that they are the same. This behavior is the same as the TableDiff utility.
Summary
This tip demonstrates a new efficient way to do table comparison by dumping out the table data into files via the BCP utility and then getting MD5 hash values of the files for comparison. The data files format are controlled by -ValueOnly parameter, if using -ValueOnly parameter, we will use -c parameter for BCP.exe (all columns are treated as character data type), otherwise, we use -n parameter (all columns are treated as native data type) for BCP.exe.
In the real world, there may be many different requirements for table comparison. One weird example, if tableX and tableY have the same number of columns with the same data types, but the column names or column sequences are different, the requirement still considers tableX and tableY are the same.
For PowerShell to be more popular among the DBA community, we need PowerShell functions that make some tasks easier than trying to do with T-SQL alone.
Next Steps
You can read the following related articles to see other options:
- Ways to compare and find differences for SQL Server tables and data
- SQL Server tablediff command line utility
- On Comparing Tables in SQL Server
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2016-11-15