I recently responded to an emergency at a customer site. The customer had an important database that had gotten corrupted that it proceeded to restore from backup tapes. The customer has a good backup policy, yet it still took six days to get back online.
Naturally, this raises a number of questions. First, what caused the corruption? Second, why did it take so long to restore the database? Finally, what could be done to improve the restoration time in the future?
These three issues must to be addressed as part of any backup/restore strategy, and drives home the point that backups are only as good as your ability to restore.
I still believe the problem is not backup but restoration, and system designers ought to be architecting for restoration of data, not backup of that data. So let's proceed to answer the three questions.
What Caused the Corruption?
Data can get corrupted a number of ways. A problem with database software, the file system, a device driver, or a RAID or disk firmware problem can all corrupt data.
On a couple of occasions, I have seen a Fibre Channel switch port and a Fibre Channel cable corrupt files. You would think this should be caught with higher-level protocols, such as SCSI, since the command should be corrupted or the Fibre Channel CRC should require a retransmit, but on both occasions this did not happen as expected.
This has led me down the path to try to better understand the issues surrounding undetectable bit error rate (UDBER). An undetectable error basically occurs when you get two errors at the same time such that the error encode (such as Reed Solomon encoding) does not pick up the error.
Here's a simple example from early in my career. The Cray-1A used something called SECDED (Single bit Error Correction Double bit Error Detection). One time, out of the blue, programs started aborting randomly, but the operating system was still up and running. On the half hour, low-level diagnostics were run on each memory location, and lo and behold, the system was having a triple-bit error. The system was never designed to correct nor report triple-bit errors. This was my first exposure to undetectable errors, but unfortunately not my last.
There is good data for what the UDBER is for tapes on the Internet. LTO is documented at 10E-27, while Sun's T10000 is listed at 10E-33. Both of these are very small numbers, and the likelihood of getting an UDBER on either of these devices is low. On the other hand, the bit error rate for Fibre Channel is 10E-12, and for SATA, 10E-14, and Fibre Channel disk drives, 10E-15. I do not know what the UDBER rates are for any of these devices, since they are not published, documented, or even whispered.
It is always easy to point fingers for corruptions at software, and I am sure that more often than not it is a good place to start, but as channels get faster and faster, error encoding for disk drives has not changed in a long time. Also, keep in mind that you must add up the whole data path from the CPU to the device to calculate the UDBER. So who knows what happened at this site, but corruptions do happen.
Why Did It Take So Long To Restore?
This was a large environment, and backups were being done via a backup client and server with the tape drives attached to the backup server. Since the clients were connected by at most GigE, the absolute fastest transfer rate that could be achieved was about 60 MB/sec. As you might remember from your LTO-3 specifications, the tape drive can run up to 80 MB/sec uncompressed, and with compression the drive can run almost twice as fast. At this site, with the average compression, the tapes were being written on average about 140 MB/sec. This is far faster than an uncontested GigE. What the site did to optimize backup performance was allow multiple client streams to be combined and written to the same tape.
Although this certainly improves backup performance, it does quite the opposite for restoration performance. Now multiple tapes have to be mounted to restore files, since they were combined from multiple machines. In fact, for this site the number of tape mounts that had to be done to restore just 6 TB was more than 140 tapes; the 6 TB of data could have fit on a little more than 15 LTO-3 tapes. Just the mount and position time for all of these extra tapes was more than three hours. You might be able to reduce this time slightly with different tape technology, but this will not make a significant difference. Products such as Copan's MAID device and Imation's Ulysses (disk in a tape cartridge) significantly improve this time.
Having all of this extra time is not helpful for meeting service-level agreements for restoration, but it does improve backup time.
What Can Be Done
Part of understanding restoration is understanding why it is needed in the first place. From what I have seen, restoration is very important in the desktop environment because careless users delete files, and there is an almost constant large restoration problem. You aren't restoring a great deal of data, but you are doing it almost constantly. On the other hand, if a mission-critical database goes bye-bye and you must restore the whole thing, this becomes a critical event, and your business depends on how fast you can restore the database.
So, what could this customer site have done differently?
In my opinion, the organization was using a one-size-fits-all backup/restore policy that was not very good for the desktop environment and even worse for the mission-critical environment because it was trying to optimize the backup problem rather than the restoration problem.
This often happens when the staff explains the problem in terms of how fast backup can be done instead of how long it would take to restore the data. When they explain the problem and feel the pain of restoration, most often it is for the desktop environment not the mission-critical data. Unfortunately, management then budgets based on that type of restoration.
So what could the site have done? In the short term, not much, because the problem was an architectural problem, not something intrinsic to its procedures. Although the organization did not have to combine client tape streams, it did not have the time to wait for all of the backups to complete within the time window. The site had all kinds of options, but not with the architecture developed.
The point is that that not only is backup often less important than restoration, but most important, a single backup architecture is unlikely to solve the multitude of issues with backup and restore. You must consider many things, such as:
- Different levels of storage performance to ensure that tapes run at rate
- Potentially different network topology for different clients to allow for full rate for a single stream to tape
- D2D backup/restore
- D2D2T backup/restore
- Potentially host-based backup/restore.
These are just a few of the issues to consider, and you will likely need different architectures for different types of backup/restore demands.
I have said time and time again that working with storage is just plain hard because storage does not scale with CPU performance, so trouble keeping up with data generation is common. This trend is not going to change, so the key to not being blamed by management when things go bad (and they will, sooner or later) is a clear explanation upfront about what can and cannot be done with the software and hardware. If they want something different, tell them to send money, and you can create a different architecture to meet the requirements. All too often I hear expectations that backup/restore for desktops should work exactly the same as backup/restore for mission-critical applications. This cannot happen without a great deal of work and money because they are not the same.
This article was originally published on Enterprise Storage Forum.