Many of the images projected on TV screens and across the Internet late last week were disturbingly familiar, but, thankfully, the cause and the result of the blackout in the Northeast region of North America were far less horrific than the terrorist attacks of two years earlier.
From a disaster recovery standpoint, September 11 brought to light some hard lessons. The coming days will reveal how many of these lessons enterprises took to heart, with August 14 being held up as a report card of sorts.
The signs of whether the IT industry was stepping up to the task were mixed in the weeks preceding the blackout. For instance, 86 percent of respondents to a recent Harris poll said their organizations were at least somewhat more prepared to deal with disasters than they were before 9-11. Another poll, however, wasn't as positive. Forty-five percent of respondents to an AmeriVault survey leave their backup tapes on site -- a terrible practice -- and just half were confident of their capability to meet recovery time objectives (RTOs), an industry standard for getting back online.
Now there is no excuse. Not having a good game plan in place -- and having a concrete procedure for protecting (and if necessary resurrecting) servers and other hardware is the centerpiece of any plan -- shows that the organization doesn't have the cultural or financial wherewithal for disaster preparedness. Emergency preparedness takes a corporate commitment, and it isn't cheap.
There are two reasons customers, investors, and other interested parties should be very skeptical of companies caught unprepared this time around. First, this emergency happened after 9-11. If the management team didn't get a wake up call from that, it will likely sleep through anything.
The second reason is that a power failure -- even a biggie like last week's -- is predictable. People know that at some point, whether this year, next year, or a decade from now, the grid is going to go down. This is the third major outage in the Northeast since 1965, a time frame within the memory of many of today's decision makers. There are excuses, but no real reasons, for not being prepared.
For those organizations that need to dust off their current disaster recovery plan and those that have yet to craft one, the following are key issues and questions disaster preparedness plans must take into account.
- If the data center is located in an area where the most likely problem is weather-related flooding, is vital gear on an upper floor of the building? If earthquakes are a threat, is the data center configured in such a way that heavy objects won't hit sensitive equipment if they fall?
- Has the enterprise determined which servers are mission-critical, which are important but not do-or-die, and which are expendable? Has technology been implemented that can prioritize and redirect assets according to that hierarchy?
- Are backup generators on site? Is there access to fuel? Is there a regular program to ensure everything is fueled and working? Is there a stockpile of key spare parts?
- If disaster preparedness requires moving through the streets, do the necessary employees and vehicles have clearance to allow them on the street during emergencies? Are the local authorities informally in the loop so that as little time as possible would be wasted checking those clearances?
- Is there a list of the names of vendors and parts suppliers? Are key contacts at these companies, complete with their work and home numbers, identified? Is that list updated as employees change jobs and leave? Is there a publicly available list of people to call? Do print outs of all important lists exist, and are they available to everyone? (A list, no matter how comprehensive, is useless if the power goes out and it is on the hard drive. It is also useless if hidden in the drawer of a manager not at work when the emergency hits.)
- Are employees asked or required to say what they know as they leave the company? Key pieces of information about the network's idiosyncrasies and where items are must be collected and kept accessible.
- Does everyone understand what to do in an emergency? Are roles delineated in such a way that no one individual -- who may be out of touch -- is irreplaceable?
These are all fairly standard disaster preparedness issues. One data center preparedness issue is new, however. The trend to virtualize machines by using a single piece of hardware to support multiple operating systems raises a challenge and an opportunity.
Since virtualization of a server increases capacity radically -- vendors often claim improvements in the 20 percent to 75 percent range -- it must be assumed that the impact of that machine going down grows proportionately. Thus, pulling the plug on a virtualized piece of hardware can have a bigger impact than doing so would on a non-virtualized environment. CIOs and their planners should seriously consider coupling virtualization with grid, mesh, and other approaches that automatically delegate the tasks of a failed virtualized server to other machines. If this is done, the use of virtualization could actually make the data center more resistant to blackouts and similar catastrophic events than it is today
Carl Weinschenk writes a weekly server hardware series for ServerWatch.