Today I’m flying out to visit one of our customers for a few days. Usually when I go to a customer site I’m looking at solution performance, but this customer requested that we come on site to review their solution and make Disaster Recovery (DR) recommendations. I love going to customer sites for a variety of reasons. I like seeing our software application in action and learning how a business uses it. I like meeting our customers, especially our customer’s DBA, and talking about how they’re managing the database, our best practices, etc. And I like solving problems.
Creating a DR strategy is a new problem to solve every time, as every customer’s implementation is different. They might all have the same pieces (database, application server, etc.), but how they are set up, how they interact, and how important they are to the overall solution vary. Except for the database…you lose that, you’re done.
At any rate, I knew this would be a good challenge when we asked the customer on the initial call, “Do you have a SLA?” A SLA, in case you’re not familiar with the concept, is a Service Level Agreement. This is an agreement held between two parties that clearly defines what level of service must be provided. In the case of SQL Server DBAs who support a database upon which an application relies, this often translates to how much up time a solution needs to have.
The customer responded with something like, “Well, we don’t have anything formal in place, but I’m sure if we asked the business people they would say two hours. No more than four.” Hm, it’s not just DR we’re talking about now. It’s also High Availability (HA). What’s the difference? High availability enables a business to continue to operate in the event of a failure because a set of technologies has been established which allow the business to access data with minimal to no interruption and/or data loss. Disaster recovery is the process by which a business recovers and then makes data available after a systemic failure. (Definition adapted from one of Paul Randal’s Technical Articles, High Availability with SQL Server 2008.)
Then I asked, “What’s the SLA for data?” The customer was initially confused, and said they just answered the question. I clarified, “You said that the system could be unavailable for two to four hours, that’s how long you can spend recovering. I want to know how much data you’re willing to lose. Is it a day’s worth? Four hours? Five minutes?” There was a few seconds of silence, then, “That’s a great question.”
This is something we will dig into more this week, but it’s a good question for every DBA to consider. What’s your SLA in terms of how long the database can be unavailable and how much data you can stand to lose? Once you have those answers, then you can decide what type of DR plan (and maybe also HA strategy) you need to implement.
And if you don’t have a SLA from the business, then take the initiative and provide one to them. I heard Kevin Kline ( blog | @kekline ) talk about this once in a webinar, and I thought it was a great idea. It’s an excellent opportunity to demonstrate that you are a proactive DBA. If you don’t know how fast you can recover, try it out. Brent Ozar ( blog | @BrentO ) has a great post on how to put yourself through a good DR test. Once you know that you need six hours to get a database fully operational and you would lose two hours’ worth of data, write that up and take that to the business owner for that database’s application and see what they say. You will learn a lot from the conversation, and make sure you are clear that you are only recovering the database. There may be additional pieces to the solution that need to be part of that SLA. If what you propose is unacceptable to them, explain what you need to meet the SLA they want. Warning: what you need will probably require money. They may not want to pay for more hardware. Don’t agree to anything that you cannot execute with confidence, because in the end if you agree to a SLA that you cannot meet and then fail when tested, SLA could mean Start Looking Around.