Basic bad hard drive recovery

I get asked about this all the time. People who are considering paying for expensive data recovery ask me how I’m able to pull the majority of their data from a drive that sounds like a prop from Short Circuit. A lot of the time you hear something down the line of ‘I don’t care what happens, I just need all the Word documents and picture files’, and a lot of the time I can provide more than they expected.

So, what’s the deal? Firstly you should always pull any drive that is making noises, secondly you should first confirm that the noises aren’t coming from your CD/DVD drive (yes, that’s happened). When you have the drive you must keep it cool, a cheap house fan on its highest setting pointed at a drive which is on a solid surface or otherwise secured in the best possible way to prevent vibration will do the trick. In fact I’ve even directed the output from an air-conditioner onto the drive to keep it cool. I don’t have an expert analysis on why it should be that keeping it cool helps but I believe the cold helps prevent issues caused by low tolerances internally preventing the rotors from functioning when they heat up. In summary, do whatever you need to keep the drive cold and ensure it doesn’t vibrate when you start running it.

You’ve pulled the drive as soon as it started clicking, or a friend gave you it and made strange shapes with their face to explain the noise it was entertaining their dog with. You know something is wrong because the drive doesn’t boot; it does start, but then crashes and blue screens or kernel panics. Now we need the data from it, as much data as the drive has the ability to give you; to do this we use GNU ddrescue.

Getting the most data possible with GNU ddrescue

Plug your drive in to your computer in while keeping it cold and vibration free and start up a console. Firstly we’ll pull all the data from it in sectors without retrying; this means that each time it hits an error from the drive it’ll continue on. However GNU ddrescue has an excellent additional feature, it logs these sectors when it gets the error. Once you have the sectors with easy to get data you can concentrate on the ones with the harder to get data and pull what you can from these. Sometimes these seem to trigger the failure of the hard drive so it’s sensible to get the data that you can get from the drive as quickly as possible. To a certain extent with drive recovery you’re always fighting time and statistical chance of another failure compounding the problem into a irrecoverable mess. To recap; we’ll grab the data we can easily then return for the harder data using different trimming and access modes. This is not a quick operation, the data recovery time-lines I usually give are measured in days and not hours.

Pull as much data as possible:
ddrescue --no-split /dev/sdb sdb.iso sdb.log

Return for the missing data, this time we go direct and home in on the remaining sectors (everything already read will be ignored):
ddrescue --direct --max-retries=3 /dev/sdb sdb.iso sdb.log

Still missing data? Retry marking all the failed blocks as non-trimmed to force it to retry full sectors (sometimes this causes it to get data when it hasn’t for the last three passes):
ddrescue --direct --retrim --max-retries=3 /dev/sdb sdb.iso sdb.log

At this point if you don’t have all the data you need to start thinking about cool down periods, the drive has been running for several hours straight and it probably needs some rest. If you’re doing this from a dedicated computer shut everything down and unplug power cables. Let all the power drain from capacitors, call your client (if you have one) and tell them the status while letting everything sit dry for several hours – I would recommend a full 12 hour stretch minimum. Sleep on it, try to remember backups for missing data.

Now we’re probably at least on day two, your drive has sat cold for several hours and you’re ready to go again. Make sure the drive will stay cool and without vibration. This is usually where you start getting desperate, the data doesn’t seem to be coming back but it still might. Lets try it again, retrim with direct access:
ddrescue --direct --retrim --max-retries=25 /dev/sdb sdb.iso sdb.log

Hopefully you have all your data now; if you don’t most people deem it as irrecoverable (on a software level).

I’m still missing data!
Really this data should be written off as gone, if you’re in a business situation then hopefully you’ve got some risk transference and you can start procedure for insurance or firing random members of someone else’s team for not taking backups. In a home situation you need to be looking at backup solutions, there are plenty and some are free.

The data you have may be enough to rebuild everything, your file system and OS may cope with very small losses without much more intervention (look at the next section).

So now we’re talking last resort data recovery. Plenty of people have advice, they range from freezing the drive to tapping it. It’s even possible to enclose your drive in an anti-static plastic bag and put it within an icebox in an empty fridge, just be careful with any frosted on ice because hard drives and water don’t play especially well. Another thing reported to help is trying different positions, turning the drive upside down or on it’s side.

While doing whatever last ditch attempts to recover the data that you choose you should be running with infinite retries:
ddrescue --direct --max-retries=-1 /dev/sdb sdb.iso sdb.log

I have all the data I can get, where next?
I’ll discuss recovering data from the image at a later date but here’s your quick and dirty ‘will it boot’ method for people to pull their data back off a new functional drive.

Copy (dd) the image onto a new drive:
dd if=sdb.iso of=/dev/sdc

When the copy is done give the drive time to spin down, make sure nothing mounted and unplug it then replug it to ensure everything was read correctly. Unmount anything that mounts automatically (or better yet have auto mount off). Now list your file systems and go through them with fsck one by one to confirm that the file-systems are usable. For example:
fsck.msdos /dev/sdc1
fsck.ext3 -f /dev/sdc2

Put the drive in the target system and boot. Actually, if you want to be forensically correct you should pull all the important data off the drive before booting, but that’s a subject for another day.

I’ll be back with more information on what to do with your recovered image and what to do if the system won’t accept it as a valid partition.

Regards, Robert.

Posted in Uncategorized | 1 Comment

The advantage of delayed upgrades.

Having worked in corporate IT and in smaller environments I’ve seen all sorts of scheduled midnight reboots and random bounces. There are lots of policies and official reasons for rebooting at set times and documenting your procedure all the way up to and including the reboot after an upgrade.

I’d like to put my two pence (or three cents at the current exchange rate) on the whole issue. Having spent many many hours of my life running troubleshooting calls and trying to drill down to the exact issues and where they come from after large scale reboots and upgrades I’ve developed a theory that I use in my own labs and on the networks that I run.

It’s simple, don’t reboot everything at the same time

This may sound like an overly simple solution, but lets look at the drill down in a basic example. Say you have twin (load balanced) IDS systems between your network and the outside world in the DMZ, then two separate IDS systems in your infrastructure network and in your user network. An update to the IDS signatures is released and you want to roll it out to your network. You have two options, push the update to all devices, reboot and minimise disruption; or, push out to the least important device and then subsequently to the other devices as the update proves itself stable.

If you push out to every device at the same time and the update brings down all your IDS systems you could be without IDS or even without connectivity for an extended period of time as you revert and reboot all devices – and that’s only if you immediately work on the IDS systems instead of testing everything else first. If you push to one device at a time worst case scenario is that you have one device go down and you have to revert it to backup, if that device is redundant you don’t even have an outage.

Although it takes longer it allows you to debug which component failed if a component fails. It also allows you to avoid pushing updates which pull down every device in your infrastructure.

In a bigger view you might have code pushes coming from your development environment into your testing environment. You have changes to the code used to generate your database, changes to the service bus, changes to the load balancer settings, changes to the web server and changes to the code running within the web server. Lets say you have a bug in the service bus code and you don’t know about it as it worked fine in development but that’s a slightly different environment. There are two ways that this can go: you can push all the updates and work through each one trying to diagnose if it’s related to the sudden collapse of your testing environment, or you can push updates one at a time testing after each and hopefully finding the service bus issue immediately after pushing.

It’s not a perfect concept, sometimes two updates are tied together but in a world of ever decentralised API’s and SOAP calls which should be version independent you can break down your updates quite a bit.

It’s something small and seems obvious but when it’s followed it can save hours of troubleshooting and fire fighting calls.

Fair winds and few fires.

Posted in Uncategorized | Leave a comment

Hello world!

Yes, I went with a cliché. Hello World. I doubt the whole of the world is listening, but it seems that you are.

Some years ago I used to run a blog here and provide humour (a little humor too), reviews and some insight into my projects. Now I will resume this tradition and shed a little light on my experience and experiments.

I hope you’re ready, I’ll see you along the ride.

Kind regards,
Robert Small.



Posted in Uncategorized | Leave a comment