Recovering a hardware RAID from failing disks

Recovering hard drives from a failing RAID is never fun. By the very nature of RAID you don’t really have a chance to just pipe files off a disk so you’re in for some work.

The fear with drive failure is that more drives will fail during recovery. This is why you want to favour systems like RAID 5 with hot spare or RAID 6. The good news is that all is not lost. You have to weigh up your priorities, either you prioritise the data and bring down your system or you prioritise uptime and have to run everything in serial, one drive at a time.

For this scenario we’ll imagine a RAID 5 with a hot spare, three active drives and one spare. You have had one drive failure, drive 3, and you’re starting to suspect that the others may not make it. Your RAID card supports hot swapping.

If you have the opportunity before working on your data recovery it’s advised to replace your hot spare drive. Make sure that the data that’s going to it has a stable platform to land on.

Running Live
This is when you’re told to keep the lights on and recover from backups if the system goes completely down. This is running data recovery live on your disks.

Step one
If you suspect the hot swap drive (4) at all swap it out now. Your RAID card might complain but no matter what it tells you it can survive on two drives briefly.

Step two
Now you have a stable hot swap (4) let the RAID rebuild onto that while you take the failed drive (3) to your workstation. Copy drive (3) to a new drive (5) using a utility like dd or GNU dd rescue. Do your best to get all data off of it. If you’re successful you can put it (5) back in your RAID as a replacement for the failed drive (3) and run a check and verify.

Step three
At this point you’ve either successfully rebuilt your RAID from drives one and two to the hot spare (4) or you’ve rebuilt the original RAID using a copy of drive three (5). Don’t do this until You have a stable RAID. If you have a hot spare at this point you may want to disable it to prevent excessive rebuilds.

Take your stable RAID and remove either drive one or two, whichever you believe to be faulty. Your RAID card will scream and be very upset, but remember that it can run on two drives. Repeat step two with the drive you removed from the RAID (1) and replace it in with the new drive (6) and put the new drive back into the RAID array and rebuild/verify.

Step four
At this point you can repeat step four with the remaining drive and then enable or add a hot spare as appropriate.

Step five
You’ve now successfully replaced all your drives one by one and not had a system failure. This by itself is a good thing and deserves a cup of tea. While you’re having a cup of tea run a background consistency test to confirm that everything matches up. The principal here is that it can be less wear on a drive to simply copy it and add the copy back in than to rebuild several times. It’s also faster and if you’re using it against a failing drive GNU dd rescue will have a better chance of getting raw data than most RAID utilities. The advantage of RAID is that it’ll take care of any holes in the data you copy from the failing drive(s).

Cold rebuild
If you ask me, which by reading this I’m going to presume you are, this is the better way. It’s only drawback is that your system is down for the duration.

Step one
Power down and remove all drives.

Step two
Take the old drives and put them into your external dock on your workstation and copy them using GNU dd rescue to new drives. If you have enough docks you can run this in parallel and greatly speed up the recovery process. Recover what’s possible, if you don’t get everything from one of the three active drives it’s OK as long as you do from the other two. The RAID should rebuild anything you’re missing just as long as there’s two drives functional for it to build from.

Step three
Put the new drives back into the RAID controller and boot up. Run a check and verify on any drives which were losing data. Depending on your RAID card check and verify may negate the need for a background consistency check but I’d tend towards doing it anyway.

Notes

So that’s it people. There’s no magic to it, just good tactics. You can use similar tactics with a Linux Software RAID, I may work through the exact steps for you to do that at some point.

The main point about this is that a drive which has most of the content on it from a copy (GNU dd rescue) will require far less rebuilding than a drive that’s completely blank (traditional hot spare behaviour). Of course there’s also the aspect that GNU dd rescue will provide you with more options for recovering the data segments than the average RAID controller and bundled software. Follow the link above for more information on GNU dd rescue usage, just remember you’re recovering a RAID disk not a standard partition table.

Ladies and gentlemen, you’ve been a great audience.

Regards,
Robert Small.

Advertisements
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s