Cleaning up dead processes

Sometimes programs are written badly, maybe by you, maybe by other people. Sometimes you end up running those programs. Sometimes these programs leave hanging processes, either waiting for user input that will never come or expecting something somewhere to happen that it’s already missed in the grand pipe line of life.

For example, a program is waiting for a socket close, the socket timed out and the program is still waiting for a hard close (for one reason or another). You’ve debugged this far but the program is closed source, black box and from a third party vendor. You could check each day, or even each hour, but whenever it hangs it locks up another of your processes which is waiting for it to finish and your whole script gets paused.

This, strangely enough, happened to me. A process gets hung up for one reason or another and won’t finish, my script copes with the process being killed but until then it’s also hung. I’ve heard of this happening with Apache and badly written scripts too. Here’s my solution.

while [ 1 ]
do
    PROC=`ps -eo pid,etime,comm | awk '$2~/02\:..\:../ && $3~/proc_name/ { print $1 }'`
    if [ "$PROC" = "" ]; then
        #Do nothing!
        sleep 5m
    else
        ps -eo pid,etime,comm | grep proc_name
        echo "Killing $PROC"
        kill $PROC    
        sleep 1h;
    fi
done

Basically this looks for a process that’s two hours (beyond normal process run time) old with the process name you specify and kills it. You could adjust this to be one hour, three hours, ten hours or whatever works best for you. Before killing the process it will print all the processes of that name and the process to be killed so you can review the logs to see what is being seen and killed.

I know it’s very dirty and a bit of a hack, but it saves some administrative time for more important things.

Good luck!

Advertisements
Posted in Uncategorized | Leave a comment

Using remote trigger files to start a file transfer

The Concept

Sometimes you want to start file transfers when you’re ready. You’re generating some files and you want the transfer to start immediately after the generated files are done.

The Server

The script to sit on the server is simple. Run the command then touch the trigger.

user@Server:~$ generate_files.sh; touch ./trigger-file

This will run and then touch the trigger file. The trigger file starts the next half of the process.

The Workstation (or second server)

Now all we need to do is wait for that trigger to be created and then start to run the transfer.


#!/bin/bash

while [ 1 ]
do
    rsync --remove-source-files --times --timeout 180 --partial --progress -e "ssh -p 22" user@server.dom:/home/user/trigger-file /home/user/
    if [ "$?" = "0" ] ; then
        echo "Starting rsync based on trigger file."
        break
    else
        echo "Backing off, waiting for trigger file."
        sleep 5m
    fi
done

while [ 1 ]
do
    rsync --recursive --times --timeout 180 --partial --progress -e "ssh -p 22" user@server.dom:/home/user/source-files/ /home/user/destination-files/
    if [ "$?" = "0" ] ; then
        echo "Done, rsync completed normally"
        break
    else
        echo "Rsync failed. Backing off and retrying..."
        sleep 5m
    fi
done

Basically this loops through until the file exist then continues on to finish the transfer. The whole process uses rsync over ssh which means that you get the wonderful resuming capabilities of rsync along with the confidentiality, integrity and authentication of ssh.

This is the basis for many scripts which I use on a daily basis to automate file transfers and it’s proven most reliable.

Regards,
Robert.

Posted in Uncategorized | Leave a comment

Finding duplicate files or the wonderful tool called fdupes

The tool

The tool itself is a pretty simple one and it works on Linux, Unix and MacOSX and it’s designed to find duplicate files within a set of directories.

Installing

With Linux/Unix your chosen distro’s package manager should have a copy of it, just install it as you usually would.
apt-get install fdupes
emerge fdupes
pacman -S fdupes

With MacOSX you need Mac Ports installed and then you can just go ahead and install it.
sudo port install fdupes

Usage

Imagine the scenario, you have daily backups of a file system compressed and archived in the same format with the same parameters. Sometimes the file system changes every day and sometimes it doesn’t for weeks on end. The backups are taken if the system has changed or not. You have a year of backups and you’re sure that about 70% of them are complete duplicates, you’re archiving these backups but the archive media won’t take the current size of 200GB. Your archive media will take the 30% (60GB) which you believe to be true unique backups but how do you remove the duplicates?

The easiest way is to run fdupes with the delete and recursive options.
fdupes --delete --recurse /path/to/backups/

You can even run it so it skips the first entry and deletes all the rest automatically.
fdupes --delete --noprompt --recurse /path/to/backups/

Just remember automatic deleting can cause data loss and you can’t guarantee which copy will be deleted. Experiment first.

Regards,
Robert.

Posted in Uncategorized | Leave a comment

Basic bad hard drive recovery

I get asked about this all the time. People who are considering paying for expensive data recovery ask me how I’m able to pull the majority of their data from a drive that sounds like a prop from Short Circuit. A lot of the time you hear something down the line of ‘I don’t care what happens, I just need all the Word documents and picture files’, and a lot of the time I can provide more than they expected.

So, what’s the deal? Firstly you should always pull any drive that is making noises, secondly you should first confirm that the noises aren’t coming from your CD/DVD drive (yes, that’s happened). When you have the drive you must keep it cool, a cheap house fan on its highest setting pointed at a drive which is on a solid surface or otherwise secured in the best possible way to prevent vibration will do the trick. In fact I’ve even directed the output from an air-conditioner onto the drive to keep it cool. I don’t have an expert analysis on why it should be that keeping it cool helps but I believe the cold helps prevent issues caused by low tolerances internally preventing the rotors from functioning when they heat up. In summary, do whatever you need to keep the drive cold and ensure it doesn’t vibrate when you start running it.

You’ve pulled the drive as soon as it started clicking, or a friend gave you it and made strange shapes with their face to explain the noise it was entertaining their dog with. You know something is wrong because the drive doesn’t boot; it does start, but then crashes and blue screens or kernel panics. Now we need the data from it, as much data as the drive has the ability to give you; to do this we use GNU ddrescue.

Getting the most data possible with GNU ddrescue

Plug your drive in to your computer in while keeping it cold and vibration free and start up a console. Firstly we’ll pull all the data from it in sectors without retrying; this means that each time it hits an error from the drive it’ll continue on. However GNU ddrescue has an excellent additional feature, it logs these sectors when it gets the error. Once you have the sectors with easy to get data you can concentrate on the ones with the harder to get data and pull what you can from these. Sometimes these seem to trigger the failure of the hard drive so it’s sensible to get the data that you can get from the drive as quickly as possible. To a certain extent with drive recovery you’re always fighting time and statistical chance of another failure compounding the problem into a irrecoverable mess. To recap; we’ll grab the data we can easily then return for the harder data using different trimming and access modes. This is not a quick operation, the data recovery time-lines I usually give are measured in days and not hours.

Pull as much data as possible:
ddrescue --no-split /dev/sdb sdb.iso sdb.log

Return for the missing data, this time we go direct and home in on the remaining sectors (everything already read will be ignored):
ddrescue --direct --max-retries=3 /dev/sdb sdb.iso sdb.log

Still missing data? Retry marking all the failed blocks as non-trimmed to force it to retry full sectors (sometimes this causes it to get data when it hasn’t for the last three passes):
ddrescue --direct --retrim --max-retries=3 /dev/sdb sdb.iso sdb.log

At this point if you don’t have all the data you need to start thinking about cool down periods, the drive has been running for several hours straight and it probably needs some rest. If you’re doing this from a dedicated computer shut everything down and unplug power cables. Let all the power drain from capacitors, call your client (if you have one) and tell them the status while letting everything sit dry for several hours – I would recommend a full 12 hour stretch minimum. Sleep on it, try to remember backups for missing data.

Now we’re probably at least on day two, your drive has sat cold for several hours and you’re ready to go again. Make sure the drive will stay cool and without vibration. This is usually where you start getting desperate, the data doesn’t seem to be coming back but it still might. Lets try it again, retrim with direct access:
ddrescue --direct --retrim --max-retries=25 /dev/sdb sdb.iso sdb.log

Hopefully you have all your data now; if you don’t most people deem it as irrecoverable (on a software level).

I’m still missing data!
Really this data should be written off as gone, if you’re in a business situation then hopefully you’ve got some risk transference and you can start procedure for insurance or firing random members of someone else’s team for not taking backups. In a home situation you need to be looking at backup solutions, there are plenty and some are free.

The data you have may be enough to rebuild everything, your file system and OS may cope with very small losses without much more intervention (look at the next section).

So now we’re talking last resort data recovery. Plenty of people have advice, they range from freezing the drive to tapping it. It’s even possible to enclose your drive in an anti-static plastic bag and put it within an icebox in an empty fridge, just be careful with any frosted on ice because hard drives and water don’t play especially well. Another thing reported to help is trying different positions, turning the drive upside down or on it’s side.

While doing whatever last ditch attempts to recover the data that you choose you should be running with infinite retries:
ddrescue --direct --max-retries=-1 /dev/sdb sdb.iso sdb.log

I have all the data I can get, where next?
I’ll discuss recovering data from the image at a later date but here’s your quick and dirty ‘will it boot’ method for people to pull their data back off a new functional drive.

Copy (dd) the image onto a new drive:
dd if=sdb.iso of=/dev/sdc

When the copy is done give the drive time to spin down, make sure nothing mounted and unplug it then replug it to ensure everything was read correctly. Unmount anything that mounts automatically (or better yet have auto mount off). Now list your file systems and go through them with fsck one by one to confirm that the file-systems are usable. For example:
fsck.msdos /dev/sdc1
fsck.ext3 -f /dev/sdc2

Put the drive in the target system and boot. Actually, if you want to be forensically correct you should pull all the important data off the drive before booting, but that’s a subject for another day.

I’ll be back with more information on what to do with your recovered image and what to do if the system won’t accept it as a valid partition.

Regards, Robert.

Posted in Uncategorized | 1 Comment

The advantage of delayed upgrades.

Having worked in corporate IT and in smaller environments I’ve seen all sorts of scheduled midnight reboots and random bounces. There are lots of policies and official reasons for rebooting at set times and documenting your procedure all the way up to and including the reboot after an upgrade.

I’d like to put my two pence (or three cents at the current exchange rate) on the whole issue. Having spent many many hours of my life running troubleshooting calls and trying to drill down to the exact issues and where they come from after large scale reboots and upgrades I’ve developed a theory that I use in my own labs and on the networks that I run.

It’s simple, don’t reboot everything at the same time

This may sound like an overly simple solution, but lets look at the drill down in a basic example. Say you have twin (load balanced) IDS systems between your network and the outside world in the DMZ, then two separate IDS systems in your infrastructure network and in your user network. An update to the IDS signatures is released and you want to roll it out to your network. You have two options, push the update to all devices, reboot and minimise disruption; or, push out to the least important device and then subsequently to the other devices as the update proves itself stable.

If you push out to every device at the same time and the update brings down all your IDS systems you could be without IDS or even without connectivity for an extended period of time as you revert and reboot all devices – and that’s only if you immediately work on the IDS systems instead of testing everything else first. If you push to one device at a time worst case scenario is that you have one device go down and you have to revert it to backup, if that device is redundant you don’t even have an outage.

Although it takes longer it allows you to debug which component failed if a component fails. It also allows you to avoid pushing updates which pull down every device in your infrastructure.

In a bigger view you might have code pushes coming from your development environment into your testing environment. You have changes to the code used to generate your database, changes to the service bus, changes to the load balancer settings, changes to the web server and changes to the code running within the web server. Lets say you have a bug in the service bus code and you don’t know about it as it worked fine in development but that’s a slightly different environment. There are two ways that this can go: you can push all the updates and work through each one trying to diagnose if it’s related to the sudden collapse of your testing environment, or you can push updates one at a time testing after each and hopefully finding the service bus issue immediately after pushing.

It’s not a perfect concept, sometimes two updates are tied together but in a world of ever decentralised API’s and SOAP calls which should be version independent you can break down your updates quite a bit.

It’s something small and seems obvious but when it’s followed it can save hours of troubleshooting and fire fighting calls.

Fair winds and few fires.

Posted in Uncategorized | Leave a comment

Hello world!

Yes, I went with a cliché. Hello World. I doubt the whole of the world is listening, but it seems that you are.

Some years ago I used to run a blog here and provide humour (a little humor too), reviews and some insight into my projects. Now I will resume this tradition and shed a little light on my experience and experiments.

I hope you’re ready, I’ll see you along the ride.

Kind regards,
Robert Small.

 

 

Posted in Uncategorized | Leave a comment