Understanding why there is still downtime for SQL Server with Microsoft Clustering

By:   |   Updated: 2009-11-17   |   Comments (13)   |   Related: > Clustering


Problem

I want to institute a high availability solution for SQL Server with no downtime. I am planning on using Microsoft clustering, but I have been told that in the event that the cluster moves the SQL Server from one physical server to another, that there is some downtime. Why is this so?

Solution

There is indeed a period of downtime whenever the SQL Server moves from one physical server to another in a traditional Microsoft clustering solution. Let's take a quick look at why that is.

First, let's consider a typical two-node cluster setup (Figure 1):

Figure 1:

virtual node

In a typical Microsoft clustering solution, all the databases reside on shared storage which all the physical servers, called nodes, can access. However, only one node can access the storage at a given time.

In addition, the SQL Server service for a particular instance of SQL Server installed on the cluster resides on each physical node where SQL Server is allowed to run. For instance, in the case of a default instance, with a SQL Server 2008 default instance setup, we would see a service called SQL Server (MSSQLServer) on each node. For a named instance called MyNamedInstance, we would see a service called SQL Server (MyNamedInstance) on each node. Because the service needs to access the databases, only the node which has access to the shared storage can successfully have the SQL Server service running. All other nodes will have the SQL Server service present in the services list, but in a stopped state. And finally, because of the fact that this SQL Server instance will need to move around between the physical nodes, in a clustering setup the SQL Server is configured to run on a virtual node, one that can float between the nodes.

These virtual node will have the following characteristics that are different from any of the physical nodes:

  • Its own IP address
  • Its own network (NetBIOS and DNS) name

So in the example shown in Figure 1 above, the physical nodes NodeA and NodeB can run the SQL Server instance. They both can access the shared storage and they both have the appropriate SQL Server service installed. For someone who wants to access the SQL Server, they must do so through the virtual node, which I've called VirtualNode1. So I wouldn't connect to NodeA or NodeB, I would set my connection to VirtualNode1. Now I'm simplifying a bunch about how clustering works and how SQL Server interacts with the cluster (and vice versa), but hopefully this gives a good overall picture of Microsoft SQL Server on a Microsoft clustering solution.

Now that we have the big picture view, let's look at the specifics of SQL Server running. We'll start by looking at Figure 2, where NodeA has control of the shared storage and where SQL Server is running. You can note this by the thicker lines from NodeA to the various resources.

Figure 2:

shared storage databases

So if we take a look at both NodeA and NodeB, we'll see that NodeA has the storage. You'll see the drive letters which belong to the SQL Server databases on that physical server. You'll be able to access the files on them and read them (provided they aren't already in use, such as the SQL Server databases themselves). While you might see the drive letters on NodeB, you won't be able to access them. The operating system will give you an error telling you the disk isn't available. Also, if we look at the services, we'll see the SQL Server service started and running on NodeA. On NodeB it'll be stopped. And if we check on the networking side, we'll see that NodeA isn't just running with its IP address, but also the IP address for VirtualNode1. NodeB will just have NodeB's IP address.

Now let's look at a situation where an administrator gracefully moves the SQL Server from running on NodeA to NodeB. And when we look at what happens behind the scenes, we'll see why there is downtime.

The cluster receives the request from the administrator and then proceeds to perform the following steps:

  1. Stops the SQL Server Agent service and any services dependent on the SQL Server service.
  2. Stops the SQL Server service (meaning SQL Server isn't available at this point).
  3. Releases NodeA's hold on the IP address and network name for VirtualNode1.
  4. Releases NodeA's hold on the shared storage.
  5. Tells NodeB to take control of the shared storage (which it does).
  6. Tells NodeB to take control of the IP address and network name for VirtualNode1.
  7. Starts the SQL Server service on NodeB (meaning SQL Server is coming back on-line for its clients).
  8. Starts the SQL Server Agent service and any services dependent on the SQL Server service.

Again, I've simplified a lot of things, but this 8 step process is basically what the cluster is doing. Note that from step 2 to step 7 SQL Server is down. NodeA has to release the resources SQL Server is dependent on so that NodeB can take control of them. Until NodeB has control of them, it can't restart the SQL Server service. So even if the shared storage and networking transfer were to happen in milliseconds, you still have the time it takes for the SQL Server service to stop and restart as downtime. There are things that could cause this to draw out, such as a configuration on 32-bit servers where AWE memory is used (along with lock pages in memory). In situations like that, the downtime is extended to take care of these matters. But once the graceful move is over, we'll see our resources on NodeB, such as what is shown in Figure 3.

Figure 3:

sql server service

And when we look at the two physical nodes, we'll see that NodeB has the storage and the IP address for VirtualNode1. We'll also see that the SQL Server service is running on NodeB but is stopped on NodeA. The situation is a reverse of what we saw in Figure 2.

But what about when we have an unintentional "fail-over" such as when NodeA goes down? In this particular case, the cluster will have to realize the node is not available any longer. It does this fairly quickly, but then it still must tell NodeB to bring the resources on-line. So basically we'll be going through steps 5 through 8 from above. Again, we'll see the downtime from whenever the SQL Server became unavailable, such as when NodeA dropped off-line, until NodeB is able to get all the resources under its control and successfully starts its SQL Server service. This is the best case scenario, a perfect fail-over situation, if you will, and this is the quickest scenario for SQL Server to come back online in the event of a real problem. There could be other cases such as when NodeA isn't down but SQL Server is unresponsive (but the service is still running). In this case, the cluster won't note that SQL Server is not available until its "keep alive" check fails. Once it determines SQL Server is not responsive, it will go through the process of trying to gracefully go through all the steps. If there any complications, then the downtime will be extended as the cluster tries to deal with them.

So with a Microsoft clustering solution, there will always be some amount of downtime as the SQL Server moves from one physical node to another, whether this be intentional, such as when it's initiated by a system administrator, or in the event of a failure, such as when the physical node SQL Server was running on suddenly becomes unavailable. Generally, the amount of time its unavailable is relatively small and that's why its considered a satisfactory high availability solution for most scenarios. But it must be understood that in this configuration, downtime is a given.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author K. Brian Kelley K. Brian Kelley is a SQL Server author and columnist focusing primarily on SQL Server security.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2009-11-17

Comments For This Article




Wednesday, October 29, 2014 - 4:52:37 PM - Niels Back To Top (35118)

Very simple and clear post! Thanks!!!


Wednesday, November 25, 2009 - 10:02:04 AM - K. Brian Kelley Back To Top (4496)

[quote user="originsone"]

 If a transaction is running and a fail over occurs, does the transaction continue once the instance comes up on the other node?

[/quote]

 No, the transaction is rolled back during start-up. Imagine if your SQL Server had a sudden hiccup and went down in the middle of a transaction. When it came back up it would roll what had been completed of that transaction back. Same idea with a cluster fail-over.


Wednesday, November 25, 2009 - 9:46:21 AM - originsone Back To Top (4495)

 If a transaction is running and a fail over occurs, does the transaction continue once the instance comes up on the other node?


Thursday, November 19, 2009 - 12:02:03 PM - Ken Donoghue Back To Top (4470)

I hope you find these useful and informative. Thanks for asking.

http://www.stratus.com/pdf/whitepapers/continuous-processing-for-windows.pdf

http://www.youtube.com/watch?v=eYQ_ib961FE

http://www.stratus.com/news/2009/documents/20091112.pdf


Thursday, November 19, 2009 - 11:45:39 AM - K. Brian Kelley Back To Top (4469)

[quote user="Ken Donoghue"]A fault tolerant server has the equivalent of two physical servers running in lock step as one logical server in a single enclosure. [/quote]

Sounds great. I like that a lot. I do still see the issue of the "kick the plug" bringing down SQL Server completely, though. And if you think about it, in a data center, if a particular circuit gets overloaded, the rack could come down. If your second circuit can't completely handle the load (redundancy in power, right?) then the fault tolerant server would be a victim to this issue. I have seen it, which is why I cite it. However, if you have different racks on different sets of circuits, this is an area where a clustered SQL Server instance can recover.


Thursday, November 19, 2009 - 11:12:49 AM - admin Back To Top (4468)

Ken,

Can you provide a URL or two for these servers to get more information?

Thank you,
The MSSQLTips Team


Thursday, November 19, 2009 - 9:29:38 AM - Ken Donoghue Back To Top (4467)

A fault tolerant server has the equivalent of two physical servers running in lock step as one logical server in a single enclosure. Everything is duplicated, right down to the power cords going into the wall. There is only one OS and one application; they see only one server image, not two. A hardware failure is transparent to the app because the app continues to run unaffected on the companion component or subsystem. The failed part or entire half of the server is automatically taken out of service. Repairs can be affected while the app continues to run and, after repair, replacement or whatever, the "two sides" will automatically resynch. No application downtime, no failover, no data loss. Transient errors also do not bring a true fault-tolerant server down. It will ride through transient errors, again with no impact on database operations. A fundemental difference between a clustered server and a fault tolerant server is that clusters are designed for failure recovery and fault tolerance is falure prevention technology. As far as SQL is concerned, you could run Standard Edition and get top-notch uptime reliability.

With regards to your article, it is a clear and cogent explanation of what happens in the cluster. Well done.


Thursday, November 19, 2009 - 9:08:45 AM - K. Brian Kelley Back To Top (4465)

Because it is a fault tolerant server. That means it could have hardware failures that it can't absorb. Case in point, a few years ago we had a clustered SQL Server instance. One of the physical nodes developed an intermittent motherboard issue. Caused the server to bluescreen. We had hot-swappable drives and redundant, hot-swappable power, but you get the idea. In this case, we simply moved the instance over to the other physical node and continued running.

 Another situation, especially in a busy data center, is the "I kicked the plug problem." You've got technicians swapping out servers and accidentally unplug one server completely or accidentally knock the power down for the rack. If you've got the physical nodes in different racks, you fail-over and keep running.

 


Thursday, November 19, 2009 - 8:42:52 AM - Ken Donoghue Back To Top (4463)

Why deal with cluster issues in the first place when you can run SQL Server on a single Xeon fault-tolerant server with virtually no downtime?


Wednesday, November 18, 2009 - 3:50:45 PM - K. Brian Kelley Back To Top (4456)

[quote user="Ayyappan"]In clustering environment we usually talk about Active/Active and Active/Passive. Can you please relate this topic with that terminology?[/quote]

Those terms are considered dated, and here's why.

  • They were coined when most clustered installs consisted of only two physical servers.
  • They were coined when most installs could only support one or two SQL Server instances running on a cluster.

So if you put those two things in mind, think of a two node cluster. If you had one instance installed, you had an Active/Passive cluster. The physical node running the SQL Server at the time was Active. The other node was called Passive since it was sitting there idling. If you had two instances of SQL Server running, you tended to run one on each physical node. So the physical nodes were both Active, hence the term Active/Active.

Now that it's not unusual to have more than 2 physical nodes in a cluster, and even in cases of 2 node clusters it's not unusual to have more than 2 instances of SQL Server installed (for instance, our production clusters tend to have 4 installed, two 2005 and two 2008), those terms aren't used so much anymore.

 


Wednesday, November 18, 2009 - 12:56:27 PM - admin Back To Top (4455)

Ayyappan,

I would check out these two tips:

SQL Server Clustering Active vs Passive

Getting started with SQL Server clustering

Thank you,
The MSSQLTips Team


Wednesday, November 18, 2009 - 10:37:38 AM - Ayyappan Back To Top (4454)

Hi,

You have explained in very nice way. 

In clustering environment we usually talk about Active/Active and Active/Passive. Can you please relate this topic with that terminology?


Tuesday, November 17, 2009 - 8:26:31 AM - admin Back To Top (4440)

Brian,

I really like how you break down the downtime in your lists as well as the images.  I think it really proves the point.

Thank you,
The MSSQLTips Team















get free sql tips
agree to terms