Hard drive failures are a fact of life: That's why there are RAID arrays, backup systems and an entire infrastructure designed to prevent data loss and minimize the impact to an organization when drives stop working.
But that doesn't mean you can't do anything to minimize hard drive failure. Google's landmark research document, "Failure Trends in a Large Disk Drive Population," published in 2007, provided a huge amount of information about consumer-grade SATA and PATA drives used in the company's servers. Much of this remains relevant when formulating disk drive failure reduction strategies today. That's because although many servers use faster enterprise-grade drives, the fundamental spinning disk technology in all hard drive systems is the same.
Here are five things you can do to help ensure the drives in your organization keep running smoothly:
1. Put your drives in a nursery before you use them
Hard drives suffer from high rates of what has been termed "infant mortality." Essentially this means new drives particularly ones subject to high utilization are especially prone to failing in the first few months of usage. This may be because of manufacturing defects not immediately obvious that quickly manifest themselves once the drive is put to work.
Whatever the reason, disks that survive the first few months of use without failing are likely to remain healthy for a number of years.
The implication of this is that you should avoid putting brand new drives into your servers. A far more sensible approach is to break them in by running them in "kindergarten" machines for perhaps three months. This will weed out the sickly disks, ensuring the ones put into production servers are the fitter, healthier ones.
2. Spot surface defects before you put a drive to work
Virtually all disk drives suffer from surface defects on the magnetic media that actually stores data inside the drive assembly. Data stored on a sector with a surface defect can be hard to read, and in some cases the data may not be readable at all. When a drive detects a defective sector because it has difficulty reading from it, it moves the data if it can and stops using the sector to prevent data loss in the future.
The problem with this approach is that it detects bad sectors only after they have been used, and only because of difficulties reading back the data they store. A much better approach is to find and mark as bad all sectors with surface defects before the disk starts storing data.
Perhaps the most effective utility for doing this is a program called SpinRite. The program makes the drive write very weak magnetic patterns onto the disk surface and tests to see if it can then read them back. If it can't read the patterns back from any area of the disk, SprinRite interprets this as an unreliable sector with a surface defect, and it stimulates the drive to mark it as bad.
Running SpinRite on every disk before it is put into use a process that can take several hours will therefore help prevent defective sectors from ever being used to store data, reducing the chance of data ever becoming unreadable in the future.
3. Choose your disk drives carefully
Google's research and anecdotal evidence indicate that the reliability of hard drives varies enormously according to their make and model. Although a given disk drive model from a particular manufacturer may prove to be very reliable, there is no guarantee a different model from a different manufacturer or even the same one won't prove to be highly unreliable.
This means it's worth keeping a close eye on all your disk drives to spot which models are unreliable. It may be prudent to remove from service all the drives of a particular model that you are running if you establish they are unreliable although there is no guarantee that the drives you replace them with will not also turn out to be unreliable.
An obvious strategy particularly for RAID systems is to use a range of different drive models from different manufacturers. Thus, if a particular model proves to be unreliable, it is unlikely to affect more than one drive in an particular array.
4. Better too hot than too cool
Google's research found that temperature doesn't affect the reliability of hard drives very much, and certainly not in the way many people expect. In the first two years of a drive's life, it is more likely to fail if it is kept running at an average temperature of 35 degrees C or less than if it was running at over 45 degrees C. That's surprising, and it suggests that spending too much time worrying about air conditioning systems may be counterproductive a warmer environment actually suits disk drives better.
From year three onward, disks running above 40 degrees C had a higher failure rate than cooler running drives, but could it be possible their higher temperatures were a symptom of their impending failure? It's not clear that that is the case, with Google concluding that "at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do."
5. Don't hang on to your drives for too long
It's impossible to predict when exactly when an individual hard drive will fail, but once the drive is two or three years old it is significantly more likely to fail than one that is just a year old. If it's one of a type you've already established to be unreliable, it's likely the risk of failure is even more extreme. To reduce the chances of disk failures in general, keep the average age of your disk fleet young.
It's also worth monitoring SMART data or any other systems that monitor a disk's inner workings to spot individual disks at high risk of failing. Once a disk starts misbehaving for example reporting scan errors or reallocating one or more sectors and marking the old ones as bad it is much more likely to fail in a short period of time.
Ultimately, reducing the incidence of hard drive failure must be an economic balancing act. Replacing drives before they fail has a cost, but so too do drives that fail or necessitate work that can more efficiently be carried out during planned maintenance periods or, in the worst case, lead to service downtime or data loss.
Paul Rubens is a journalist based in Marlow on Thames, England. He has been programming, tinkering and generally sitting in front of computer screens since his first encounter with a DEC PDP-11 in 1979.