Automate Bringing SQL Server Cluster Resources Online After Failover is Unsuccessful

By:   |   Updated: 2011-01-12   |   Comments (5)   |   Related: > Clustering


Problem

Have you noticed that when a failover is initiated on a SQL Server cluster it does not always do so successfully? Often nothing is wrong, it's just taking too long to confirm a resource or Windows id. The cause may vary but the solution is the same. You log onto the cluster server and bring the cluster resource group online. So how can you automate this?

Solution

Use the Windows Cluster Command to query the cluster resource group's status. Then when any resources are found offline, execute a subsequent cluster command to bring them online. A remote server is used to execute the commands to insure their execution regardless of the state of the cluster.

The cluster command supports remote executions from a Windows non-clustered server making it very flexible.

A Windows batch command script is used to execute the cluster commands and identify the cluster resource groups needing help. It will be slightly customized for each cluster. Although this is not a SQL Server component, SQL Agent is used to schedule the execution of the scripts. As an alternative, Windows Scheduled Tasks could be used, but that would require additional scripting not covered in this article.

In addition to bringing cluster resource groups online, there are several other requirements:

  • Notification when offline resource groups are detected
  • Keeping a history of output to assist when issues arise
  • Automate this process through scheduling

This article and its attachments address all of these requirements.

Technical Requirements

  • Windows clustered servers
  • Remote server to act as the monitoring server to run the scripts
  • SQL Agent on monitoring server
  • SMTP Mail access from monitoring server
  • Windows cluster command on monitoring server
  • A domain id with local administrator permission on the Windows cluster and remote access from the monitoring server executing the cluster commands.

Monitoring Server

The server you use to schedule and execute the cluster monitoring script needs to be reliable with few external dependencies. It should also:
  • use internal storage
  • stand-alone server (not clustered or virtual)
  • have a reboot schedule different from the servers it is monitoring
It needs to be a lean, not so mean, and reliable machine.

Step by Step Instructions

The SQL Agent job uses multiple steps to meet all the requirements mentioned earlier. The instructions for creating each step are provided in tandem with instructions on customizing the respective scripts. The script changes will involve server names, cluster resources, email addresses, and other items unique to your environment. Nothing complex.

Attached to this article are the scripts used and a sample SQL Agent job. The scripts are required, but the SQL Agent Job is optional. It has been provided as a sample to help get you started and for comparison with the job you create.

SQL Agent Job Outline
  1. Create one-step for each cluster to check its cluster resources and bring them online.
  2. Send e-mail when resources are found offline.
  3. Keep history.
  4. Purge old history.

Step 1 - Create Job

  1. Connect to your server that will be used to monitor the clustered servers.
  2. Create the new job "DBA - Bring Cluster Resources Online". As an alternative, you can use the attached sample Job SQL script.
  3. Confirm the SQL Agent service is assigned a windows domain id with permission to execute the cluster command on the remote cluster.
  4. Create the following folders on the server for the scripts and output
    E:\DBA_Scripts\Cluster
    E:\DBA_Scripts\SendEmail
    E:\Output\Job\Cluster_online

Step 2 - Check Cluster Step(s)

The script BringResourcesOnline_SERVERNAME.cmd was designed to be as simple as possible. It checks all the cluster resources on the Windows cluster. If any are found offline, a cluster command is executed to bring them online. If a resource group happens to be online when the cluster command is executed nothing will happen. Please note the cluster status values used are for Windows 2003, they may vary with different versions of Windows operating systems.

Execution of the cluster command is quick with the exception of when the Windows cluster name resource is down. Then it hangs for awhile. When this happens the job should hang as well. Any cluster problems that require more than bringing a cluster group online is out of scope for this script.

Create Cluster Online Script

  1. Edit the sample script BringResourcesOnline_SERVERNAME.cmd.
  2. Set variable ServerName to Windows cluster name. Do not use the SQL Server cluster name because it might be offline.
  3. Review the cluster group names in the "Bring all Groups Online" section. Change any group name to match the resource groups on your Windows cluster. Add additional lines if you have more than two cluster resource groups.
    Example: "cluster %ServerName% groupname "Group" /ONLINE"
  4. Save the script file in the Cluster folder. Include the name of Windows cluster being monitored in the file name.

Create Cluster Check Step

  1. Create new step
  2. General Pane
    1. Type: Operating system (CmdExec)
    2. Run as: SQL Server Agent Service Account
    3. Command: call E:\DBA_Scripts\Cluster\BringResourcesOnline_SERVERNAME.cmd
      * Specify your drive, folder, and script name used.
  3. Advance Pane
    1. Specify an output file name Cluster_online.txt with the job output directory E:\Output\Job\Cluster_online
    2. Check the box "Append output to existing file"
    3. Check the box "Include step output in history"
      Note: Do not worry about the appending to the file, the output file will be renamed in a later step.
    4. Click OK to save
If you have more Windows clusters, repeat this step for each Windows cluster; creating a new script and job step for each cluster.

Step 3 - Send E-mail

This step checks the output file Cluster_online.txt for any cluster resources that need to be brought online. When found, an email is sent with the output file as an attachment. This is why the output is appended together, so only one email is sent. Two scripts are used; one checks the captured output and the other sends email.

Customize Send Email Script

  1. Edit send_offline_email.cmd
  2. Set variable ClusterOuput to the full location of the file Cluster_online.txt
  3. Set variable EmailGroup to a valid email address. If more than one address is to be used, separate them with semicolons.
  4. Set variable SendmailCommand to the full location of the SendMailCDO.vbs script

Customize SendMailCDO.vbs Script

The sample code used for SendMailCDO.vbs was published by Paul Sadowski. I made some modifications, which are documented in the script.
  1. Edit SendMailCDO.vbs
  2. Change YOUR.STMP.COM to your SMTP server.
  3. Save in SendMail folder. I use a separate folder because it might be used by other jobs/scripts in the future.

Create Email Step

  1. Create new step
  2. Type: Operating system (CmdExec)
  3. Run as: SQL Server Agent Service Account
  4. Command: call E:\DBA_Scripts\Cluster\send_offline_email.cmd
    * Specify your drive and folder used.
  5. On Advance pane, check the box "Include step output in history"
  6. Click OK to save

Step 4 - Output History

Rename the output file Cluster_online.txt by appending the name with the current date and time.

Customize Rename Script

  1. Edit rename.cmd
  2. Set variable ClusterOutput to the full for the the output file Cluster_online.txt.
    Example E:\Output\Job\Cluster_online\
  3. Save file in Cluster folder
Note: If this job runs more than once a minute, the old file with the same date and time is deleted.

Create Rename Step

  1. Create new step
  2. Type: Operating system (CmdExec)
  3. Run as: SQL Server Agent Service Account
  4. Command: call E:\DBA_Scripts\Cluster\rename.cmd
    * Specify your drive and folder used.
  5. On Advance pane, check the box "Include step output in history"
  6. Click OK to save

Step 5 - Delete Old Output Files

This step uses two scripts. The first provides the path and delete settings for the Cluster_online.txt files. The second is a VBScript published by Microsoft used to delete old files.

Customize Delete Script

  1. Edit: delete_Cluster_online_files.cmd
  2. Set variable ScriptFolder to E:\DBA_Scripts\Cluster\ to the location of DeleteFiles.vbs.
    Example: E:\DBA_Scripts\Cluster\
  3. Set variable ClusterOutput to the location of this jobs output files.
    Example E:\Output\Job\Cluster_online\
  4. The script is set to delete any files not accessed in the last 30 days. Change the value 30 to a more appropiate setting if need be.
  5. Save file in Cluster folder
Note: DELETE parameter at the end of command, tells the script DeleteFiles.vbs to delete the files.

Script DeleteFiles.vbs

This script was published by Microsoft. There are no changes for this script.
  1. Save file in Cluster folder

Create Delete Step

  1. Create new step
  2. Type: Operating system (CmdExec)
  3. Run as: SQL Server Agent Service Account
  4. Command: call E:\DBA_Scripts\Cluster\delete_Cluster_online_files.cmd
    * Specify your drive and folder used.
  5. On Advance pane, check the box "Include step output in history"
  6. Click OK to save

Finished

  1. Save the job.
  2. Add a schedule once you have fully tested it.
  3. If you used the sample job included with this article, you will need to enable it when you are ready and add a schedule.

Testing and Scheduling

Testing this job and scripts can be a little tricky if you do not have a cluster to play with. You can run each script manually to verify the execution is as expected and then run the entire job for one cluster. Then add more clusters as you feel more confident.

I schedule this job to run 1 hour after a scheduled reboot. It continues to run every 30 minutes until the start of the normal business day.

Sample Cluster Commands

cluster /List
cluster servername node /status
cluster servername group /status
cluster servername group "SQL Group" /online

Sample Execution Output:

sample execution output

Sample Execution Output for Partially Online Resource Group

Sample Execution Output for Partially Online Resource Group


Conclusion

The goal of this article is to provide you with the commands and scripts needed to bring a Windows cluster resource online after a failed failover. It is nothing fancy and does not address any issues when resources are truly unavailable. It acts as a second attempt to bring the cluster resources online in place of someone manually having to do this. You may never need this script but after a few early Sunday morning wake up calls, you might find it useful.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author David Bird David Bird has years of IT Experience working as a DBA and programmer on Windows, UNIX, and mainframes.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2011-01-12

Comments For This Article




Monday, January 9, 2012 - 11:44:43 PM - Winter Back To Top (15579)
That insight would have saved us a lot of effrot early on.

Friday, January 14, 2011 - 11:17:19 AM - David Bird Back To Top (12610)

SQL Agent does not need to be used as the scheduler on the monitoring server. Windows Task Scheduler can be used instead in which you specify a different user id for each scheduled job. If each window cluster has a different local admin user id, then a separate scheduled task would be created on the monitoring server for each windows cluster with its own user name.

System wide monitoring tools such as Microsoft System Center Operations Manager (MOM) often require local admin permission on the servers they are monitoring. This is a monitoring script. Originally, I was thinking of implementing it in MOM but was able to find a secured way to schedule and execute it without MOM. Maybe MOM or another system wide monitoring tool could be used to schedule or execute the script in your environment.

I am not against the best practice of least privilege. However, each shop needs to balance security with productivity and production support.


Thursday, January 13, 2011 - 9:18:59 PM - Allan Hirt Back To Top (12605)

PowerShell for clustering stuff is W2K8 R2 only.

Only the service account (or domain group if using that) used for SQL Server Agent needs to be placed in the local Administrators group on every node if you are using the auto restart feature, otherwise nothing needs to be placed there.

So as a best practice, you are going against least privilege. I can tell you that every single company I have worked at will not want to escalate if they do not have to.


Thursday, January 13, 2011 - 12:07:23 PM - David Bird Back To Top (12601)

I was unaware of the depreciation of the Cluster command on Windows Server 2008 R2. My IT shop won't be going to that release for a while because of the expense of updating all the windows server licenses.

To run window scripts from SQL Agent does not require xp_cmdshell to be enabled. Xp_cmdshell is disabled by default but that does not make it dangerous or unsecure. To use it requires CONTROL SERVER permission. It is currently disabled on the monitoring server and clusters I use the script against.

The SQL agent account does require the proper permissions to the remote servers to execute the CLUSTER command. I cannot comment on what percentage of customers use SQL Agent accounts that have with Local Admin permission on other servers. My perception, is it varies from shop to shop. Each IT department must reach their own balance when setting up permissions between SQL Server instances.

I did not consider using powershell but that will change as it becomes more familiar to me.

 

Thanks for your comment, you got me thinking about Powershell

 

 


Wednesday, January 12, 2011 - 3:14:47 PM - Allan Hirt Back To Top (12591)

Somewhat dangerous advice on a few levels.

1. Cluster.exe is deprecated as of Windows Server 2008 R2 with the introduction of PowerShell. Advocating the use of cluster.exe is NOT the way to go.

2. Using a cmdexec SQL Agent job means that you have to enable xp_cmdshell which is disabled by default and is a security problem for many.

3. I assume the Agent account would also have to have admin rights in the cluster to do what you want to do - this is not the case for many customers.















get free sql tips
agree to terms