Revisit your use of the SQL Server REMOTE join hint

By:   |   Updated: 2012-09-18   |   Comments (5)   |   Related: > Query Plans


Problem

At one point you may have been advised to use the REMOTE join hint when using queries across SQL Server linked servers. You may have even observed substantial performance gains by specifying this join hint; if the local table had a small number of rows, and the remote table is significantly larger, you would expect the result of the hint (which tells SQL Server to process the join remotely) to lead to a better overall query. But unless those systems are still running the same version of SQL Server, what worked well once may not be working so well today.  Check out this tip to learn more.

Solution

Over the past week I have played with various local table sizes and very large remote data sources (both permanent tables and derived tables). I would expect these to be areas for the REMOTE hint to shine, since they are right in that hint's wheelhouse: small, local tables that make more sense to push to the remote server instead of trying to join over the wire. The result was actually the opposite in all cases: the REMOTE hint caused a substantial performance hit. I built these simple tables locally, on a SQL Server 2012 instance:

CREATE TABLE dbo.blab_key(id INT PRIMARY KEY);
GO
CREATE TABLE dbo.blab_heap(id INT);
GO
-- four rows that are valid object_ids in remote system
INSERT dbo.blab_key  VALUES(-649911850),(-480106018),(-303587480),(-448);
INSERT dbo.blab_heap VALUES(-649911850),(-480106018),(-303587480),(-448);

Then I wrote queries like these, joining the IDs to both fabricated and permanent tables on remote 2005 and 2012 instances:

-- standard inner joins:
SELECT * FROM splunge.dbo.blab_key AS k
  INNER JOIN [remote_server].tempdb.sys.all_objects AS o
  ON k.id = o.[object_id]
  INNER JOIN [remote_server].tempdb.sys.all_columns AS c
  ON o.[object_id] = c.[object_id];
  
SELECT * FROM splunge.dbo.blab_heap AS k
  INNER JOIN [remote_server].tempdb.sys.all_objects AS o
  ON k.id = o.[object_id]
  INNER JOIN [remote_server].tempdb.sys.all_columns AS c
  ON o.[object_id] = c.[object_id];
  
-- REMOTE hint on initial join:
SELECT * FROM splunge.dbo.blab_key AS k
  INNER REMOTE JOIN [remote_server].tempdb.sys.all_objects AS o
  ON k.id = o.[object_id]
  INNER JOIN [remote_server].tempdb.sys.all_columns AS c 
  ON o.[object_id] = c.[object_id];
  
SELECT * FROM splunge.dbo.blab_heap AS k
  INNER REMOTE JOIN [remote_server].tempdb.sys.all_objects AS o
  ON k.id = o.[object_id]
  INNER JOIN [remote_server].tempdb.sys.all_columns AS c 
  ON o.[object_id] = c.[object_id];
  
-- REMOTE hint on both joins:
SELECT * FROM splunge.dbo.blab_key AS k
  INNER REMOTE JOIN [remote_server].tempdb.sys.all_objects AS o
  ON k.id = o.[object_id]
  INNER REMOTE JOIN [remote_server].tempdb.sys.all_columns AS c
  ON o.[object_id] = c.[object_id];
  
SELECT * FROM splunge.dbo.blab_heap AS k
  INNER REMOTE JOIN [remote_server].tempdb.sys.all_objects AS o
  ON k.id = o.[object_id]
  INNER REMOTE JOIN [remote_server].tempdb.sys.all_columns AS c
  ON o.[object_id] = c.[object_id];

Results as shown by SQL Sentry Plan Explorer:

Plan Explorer runtime results from first batch of queries

The results largely speak for themselves - the version with one REMOTE join performed marginally better in terms of duration (~3%), but had higher reads. The query with two REMOTE joins was atrocious.

And then this slight variation, where I try to force a REMOTE join against the product of sys.all_objects and sys.all_columns:

SELECT * FROM splunge.dbo.blab_key AS k
  INNER JOIN ( SELECT o.[object_id]
      FROM [remote_server].tempdb.sys.all_objects AS o
      CROSS JOIN [remote_server].tempdb.sys.all_columns AS c
  ) AS o ON k.id = o.[object_id];
  
SELECT * FROM splunge.dbo.blab_heap AS k
  INNER JOIN ( SELECT o.[object_id]
      FROM [remote_server].tempdb.sys.all_objects AS o
      CROSS JOIN [remote_server].tempdb.sys.all_columns AS c
  ) AS o ON k.id = o.[object_id];
SELECT * FROM splunge.dbo.blab_key AS k
  INNER REMOTE JOIN ( SELECT o.[object_id]
      FROM [remote_server].tempdb.sys.all_objects AS o
      CROSS JOIN [remote_server].tempdb.sys.all_columns AS c
  ) AS o ON k.id = o.[object_id];
  
SELECT * FROM splunge.dbo.blab_heap AS k
  INNER REMOTE JOIN ( SELECT o.[object_id]
      FROM [remote_server].tempdb.sys.all_objects AS o
      CROSS JOIN [remote_server].tempdb.sys.all_columns AS c
  ) AS o ON k.id = o.[object_id];
Plan Explorer runtime results from second batch of queries

And here we see similar results - the REMOTE join performs poorly compared to a join without the hint, and look at the reads incurred for the REMOTE join with the heap, even with a slightly lower duration that has to be a concern, especially if you have an inferior or overloaded I/O subsystem.

You're probably thinking, well, those aren't really tables, and yes, you're mostly right. I listed what I used here to make it much easier to test, but I confirmed very similar behavior against very large tables that I constructed myself. I also tried variations where, instead of SELECT *, I used individual columns from both sides. Results were consistent across all of these tests: INNER JOIN performed better than INNER REMOTE JOIN in almost all tests, and was only beaten marginally in one single test as shown above.

Conclusion

As you can see, the REMOTE join hint does not seem to perform as well as the documentation would lead you to believe - at best it performs the same as leaving the hint out, depending on how you measure. You may have old code that has been upgraded to modern versions that isn't performing as well as it could be, and you may have been lulled - over the years - into believing that this join hint will always help these scenarios. I think join hints in general are very useful, but they can also be very dangerous if they're used out of habit instead of for very narrow scenarios. So I strongly recommend you revisit any code that is currently using this hint, and test thoroughly before implementing it in any new development.

That said, I would, of course, be very interested to hear about any scenarios on modern versions of SQL Server (2008+) where an explicit REMOTE join yields better performance than an equivalent query without the hint (and by "better" I mean less reads *and* a duration different by more than a few percent).

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2012-09-18

Comments For This Article




Friday, October 10, 2014 - 9:50:50 AM - Paul Back To Top (34920)

I use openquery


Monday, December 24, 2012 - 1:29:09 PM - Dimitrije Back To Top (21126)

This is very intersting.

I currently have a problem with sql2012 using some questionable plans where sql2005 and 2008r2 used prefectly reasonabe plans.

I have a POS terminal that works offline with it's own sql express instance and communicates with central DB trough linked server over VPN only when opening or closing a shift.

On central server we have one of our tables with about 10 milion rows 

On POS we create new rows that are then inserted in central db on shift closing.

it goes throught linked server and is structured something like 

insert into [remote].DB.dbo.table1 (col1, col2..)

select col1, col2,... from table1 a where timestamp>= lastdbts and timestamp < currentdbts

where not exists (select 1 from [remote].DB.dbo.table1 where pk1=a.pk1 and pk2=a.pk2)

And in 2012 optimizer uses nested loops left anti semi join, which practically means we get n calls to linked server wher n is number of local rows that are within timestamp>= lastdbts and timestamp < currentdbts.

It creates a lot of network traffic and, due to latency, it is excruciatingly slow.

Right now I downgraded offline POS to 2008R2 until I fugure out how to deal with this situation. 

 


Tuesday, October 23, 2012 - 1:17:37 PM - Chuck Lathrope Back To Top (20055)

Paul, I haven't heard that adding a join hint would add the implicit force order option. Is it documented somewhere? I have used the force order purposefully in the past and it is really helpful on complicated multitable joins.

Thanks,

Chuck


Wednesday, September 19, 2012 - 3:44:43 AM - Paul White Back To Top (19565)

Hi Aaron,

The REMOTE join hint, like all the other join hints (LOOP | HASH | MERGE) includes an implicit OPTION (FORCE ORDER).  This forces the access order of tables to match the written form of the query.  Your particularly poorly-performing example produces the same plan with OPTION (FORCE ORDER) without the REMOTE join hint.

The FORCE ORDER hint is extremely powerful and widely misunderstood; removing the optimizer's ability to reorder joins (and the other side-effects of this hint) means it should only be used exceedingly rarely, and only by the truly expert.  I could go on about this, but I think you get the idea.

The REMOTE hint can be a very powerful tool for improving DQ plans, but there are traps for the unwary.

Cheers,

Paul


Tuesday, September 18, 2012 - 12:40:22 PM - Lionel Back To Top (19552)

Hi Aaron,

Interesting post.  I'd be interested in the cost from a total response time point of view.  I'm speculating that if it was reasonably low the hint is not optimum for this part of the spectrum as you can imagine that the curve would get much better the row total involved in the join grows.   I assume you updated the statistics too.

cheers,

Lionel















get free sql tips
agree to terms