By: Jeremy Kadlec | Updated: 2008-06-13 | Comments (3) | Related: More > DBA Best Practices
Problem
Over the years I am sure we have all seen and heard about various SQL Server disasters either at our organization, on the web, or at conferences\user group meetings. Friday the thirteenth seems like an appropriate time to outline some of the common disasters we have seen over the years and provide some hind sight into the situation to prevent a future disaster. Have you seen disasters in your environment? I would bet you have at some point. Let's see if our top 13 disasters matches your experiences.
Solution
Problems are going to occur and unexpected issues are going to slip through the cracks during testing. However, see if our top 13 list of disasters are similar to your experiences, if not, let us know about some of your worst cases in the forums.
13. Misunderstood data - Over stating statistics because the data is not clearly understood. Whether credits are missing from calculations, transaction dates posting earlier than the actual transaction in the original system, currency conversion, rounding issues due to mismatched data types or money being stored in dollars in one part of the application and pennies in another, make sure the data being reported on is clearly understood.
Consider these preventive measures:
- Make sure the data is clearly understood. If it is not, ask a million and one questions to avoid any misunderstandings.
- As the reports are being finalized, review them with subject matter experts as a checks and balances.
- Once the data is understood, document it.
- Check out these tips:
12. Rollback plan - With a tight deadline corners get cut to make the date. One item that typically gets cut is the rollback plan. Not having a rollback plan during a major deployment even at the simplest level is a disaster waiting to happen. Over the years, we have seen no database backup taken before starting the deployment. Same with no DDL and/or DML backups before starting the process. To make matters worse, haphazardly making changes to try to fix the issue means rolling back is going to be a disaster.
Consider these preventive measures:
- If you do not have the time for a comprehensive rollback plan, take database and DDL\DML backups.
- Make sure the code review and testing is bullet proof.
- Check out these tips:
11. SQL Slammer - When this virus hit organizations around the world this SQL Server worm made the main stream news with global impacts. It also opened the eyes of DBAs and Network Admins related to just how many SQL Server instances exist in an organization. That was a real eye opener. The reality was that the buffer overflow vulnerability was a known issue that Microsoft had pushed out a hot fix for months earlier.
Consider these preventive measures:
- Do not install SQL Server on the default ports.
- Stay current on all of the SQL Server service packs and patches.
- Make patching SQL Servers a priority.
- Check out these tips:
10. SQL Injection - Although the cartoons that have floated around the Internet get a chuckle or two, the problem is serious. When a SQL Server can be easily compromised all of the data on the machine is potentially at risk. Not a good state of affairs.
Consider these preventive measures:
- Avoid Dynamic SQL as much as possible on publicly facing web sites.
- In your front end code validate all of your input parameters prior to processing to avoid the issue.
- Conduct code reviews with security in mind to ensure this issue or others are not in the realm of possibilities.
- Check out these tips:
9. Key deployments without essential team members - Sometimes the ship should not set sail. If one of your key members are legitimately sick asking them to work from home should be thought about twice. So should moving forward with the deployment. If all of your major systems are getting deployed, upgraded, updated, etc. at once and the person who has been doing the majority of the technical work is out sick because they have been running themselves ragged, let the date slide by a day or two. If you run into even a minor snag they are going to be hard pressed to be anywhere near the top of their game.
Consider these preventive measures:
- If the deadline is so tight that documentation and knowledge transfer could not occur and one of your key people are out of the office from either being sick or having a sick child, wait until tomorrow or the next day.
- If you do not have the luxury of being able to miss a deadline, then be sure to build time into the plan to have multiple team members know everything that is going on such that any one of them can be out of the office and the project can continue.
8. System changes without reporting updates - If you change the fundamental way you do business by a single or series of application changes and do not update the corresponding reports you will have executives\line of business managers at your cube as fast as they can walk across the building. When reports previously had numbers with multiple commas and now are reporting zeros as they see products being shipped out the door, somebody is going to have to do some explaining and fast.
Consider these preventive measures:
- As you make changes, consider the down stream impacts.
- Make sure someone on the team has the bigger picture in mind. They may not know all of the technical details, but they should know the dependencies and can point you to the person that does have the technical details.
7. 24X7 expectations without the infrastructure - I am just not talking about hardware and software solutions. People need to be in the equation. Having an expectation that manufacturing lines or web sites will run 24X7 without the proper people, process and technology are going to just run is a fallacy.
Consider these preventive measures:
- Organizations that truly need to support a 24X7 infrastructure have a significant amount of redundancy with facilities, connectivity, hardware, software and people. Over the years I have seen a few different white papers giving some insight into the challenges and precautionary measures these sorts of companies face. If you are feeling the pressure then do some research and see what other organizations are doing to make this a reality.
6. Breaking the deployment plan - Dropping the triggers in order for a process to complete quickly is probably not a good decision if that is not a portion of your deployment plan. The triggers have been developed and tested for a reason. During the testing of your current process they made have been tested to validate the historical data is captured. In the circumstance I observed the triggers were not backed up, dropped and forgotten about for a week or so until downstream processes seemed to have incomplete data.
Consider these preventive measures:
- Build and follow the deployment plan to ensure the deployment is repeatable.
- Consider a technology tool to assist with the deployment.
- Make sure you have a rollback plan.
- In this circumstance, if the trigger does need to be dropped, disable it rather than dropping it.
- Check out these tips:
5. Overstate your resume - From a personal career perspective this is probably the number one item on the list. With the vast number of tests, social networks and access to free information, overstating your resume will only lead to a personal disaster. It might mean that you are fired. Yes - people have been fired for overstating their resume. Or you can quickly be overwhelmed with a situation that you do not have the experience to manage.
Consider these preventive measures:
- Just be honest with your skills and accomplishments. If you have areas where you need improvement, be sure to fine tune those skills before you move on to your next opportunity.
- If during the interview process you get asked questions in a new or different area, answer the questions honestly.
- Check out these tips:
4. No backups - No - Not just for a day or two on one database. No database backups are being issued. The networking group thought the DBAs were backing up the databases and they were just backing up all of the files in the file system. The DBA group thought the networking group was using a third party product with an agent to backup all of the databases like the file system. In either case, no steps were ever taken to have any sorts of checks and balances between the groups. This meant that the first time a SQL Server failed, no databases were available to restore back to. With a little bit of analysis it became quickly apparent that none of the databases across the company were actually getting backed up.
I have heard another similar story that a multi-million dollar generator was installed and the facilities team was trained on how to use, maintain, test, etc the generator. Three to six months later the team had significant turn-over. One of the items that fell between the cracks was weekly testing and maintaining the generator. Unfortunately, the power failed during an upgrade with most of the IT team on site for the deployment, the UPS's started and the generator was expected to start to run the data center. Unfortunately it did not start and when the UPS's ran out of power the servers shutdown.
Consider these preventive measures:
- Although generators are typically not a portion of a DBAs responsibility, backups should be in their purview if they are not directly responsible for backup and recovery. As such, make sure that their is some sort of checks and balances between the groups.
- If all of the time, energy and money was spent to setup a backup and recovery, disaster recovery or high availability solution, then test it quarterly (at a minimum) to validated everything is working properly with any changes since the last test. The reality is that change is constant, so testing regularly is essential.
- Check out these tips:
3. No security - Having applications authenticate with the sa login is probably not necessary. Unfortunately it still happens even with the big security awareness over the last few years.
Consider these preventive measures:
- Although security has been getting better over the years, we all need to be diligent about ensuring our applications and infrastructure are secure.
- Make sure security is not a one size fits all solution. Look at security from every angle and at each tier in the application. From the public network to the individual column, be sure the overall application is secure. This probably means a variety of groups are involved in the solution, so be sure to have each one of them equally participating in the overall solution.
- Consistency is great, but that may not be the case with passwords or overall security policies. Variety really might be the spice in security.
- Keep an ear out for new vulnerabilities and do not assume your systems are secure and immune from new threats.
- Check out these tips:
2. Ignoring Errors - Nothing is perfect and things break. But ignoring warning signs or blatant errors from systems that have been setup to either work proactively or reactively does not make sense.
Consider these preventive measures:
- Setup a consistent means to capture and report warnings\errors.
- Make sure a group of people receive the errors and warnings.
- Do not assume everything is a false positive and ignore the messages. If they are truly false positives, then tune the warning and error notification system.
- Make sure the warnings and errors are truly warnings and errors. If not, all of the messages quickly become noise and none of them are given the proper attention.
1. Not trusting your instincts - If something does not feel right, it probably isn't. When you hear that little voice in your head, listen to it. Whether it is a manager in another department hounding you for passwords or code being pushed out that is a mess, stop and do not make a hasty decision.
Consider these preventive measures:
- When something does not feel right, take a step back, slow down and think about what is going on, then how to correct it.
- Listen to those little voices inside your head or your upset stomach, it probably means something isn't how it should be.
- Often times preventing a problem before it starts yields numerous benefits.
Next Steps
- Many of these items can be prevented and corrected by DBAs. So be sure to prevent these sorts of disasters and/or correct issues before they become disasters.
- What can really make or break your career is stepping up to the plate when a disaster occurs. The best place to be is in a situation where you are able to prevent all of them before they occur. Unfortunately, that is next to impossible if you are stretched thin. So if disaster does strike, step up to the plate and figure out how to resolve the issue based on the circumstances.
- Have you seen other disasters that top the 13 listed here? Share your experiences with the community in the forums below.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips
Article Last Updated: 2008-06-13