Grep one file against another

This came in handy the other day, I was looking to grep a list of things I needed against a list of things I had available. I’d quite forgotten I knew how to do this until it suddenly sprang on me that I needed to and didn’t want to write a script to do it. Of course you could do a either or expression but that loses the fun aspect when you have many different things you’re looking for.

I’ve included both methods here, I prefer -a but not all copies of xargs support it (MacOSX seems to lack it).

xargs -a want.list -I {} grep {} available.list

cat ./want.list | xargs -I {} grep {} available.list

If you have pattens which match multiple entries you could use sort to clean that up.

xargs -a want.list -I {} grep {} available.list | sort -u

cat ./want.list | xargs -I {} grep {} available.list | sort -u

You can take this further if you’re trying to clean up a file system you could do the following.

xargs -a want.list -I {} grep {} available.list | sort -u | xargs rm

cat ./want.list | xargs -I {} grep {} available.list | sort -u | xargs rm

This has all sorts of applications throughout text manipulation. I’ve found it especially useful to pull through log files where you’re looking for 20-30 different expressions.

Regards,
Robert Small.

Posted in Uncategorized | Tagged , , , | Leave a comment

Building stronger networks

This post is at least half inspired by the massive force of water that descended on the Chicagoland region in the summer. What I realised very quickly is that the roads and the water drainage systems were not designed to cope with this at all. Some areas were coping extremely well with the traffic and the rain whereas others were failing absolutely; the end result being complete failure. In fact I ended up wading – shoes and socks in hand – across the car park to work; but that’s another story entirely.

Aside from the obvious parallel here I had to start to think why, to quiz myself on what went wrong. The result was obvious. Some failings somewhere had caused a patchy outage for the roads which resulted in complete failure of the roads as a network. Specific spots on the road had been designed terribly. For example one express-way dipped down low enough that it became a small lake limiting three lanes of traffic to one lane going through about a foot of water. This meant the rest of the road was held up by this single spot. No redundancy and poor planning had resulted in an outage.

Other spots were causing blockages along my route, small blockages which created massive delays. This is not unlike many corporate and office networks. We build our networks around the exciting and memorable things but forget the details. Your firewall throughput of 10GB/s may be amazing but the IPS everything flows through afterwards is acting like that small lake on the express-way. Many networks are built over a long period of time when standards change and get rebuilt. The classic example is the dual-mode wireless router which gets held at 802.11G because of a printer which doesn’t support 802.11N. Although I may well be preaching to the choir with a lot of my readers – given my most followed links go to Cisco.com – I’d still remind that a poorly placed printer or IPS can ruin an otherwise well designed network.

Some ideas:

  • Place older IPS systems to the side and have them direct their actions to newer routers or 3L switches.
  • Attach wireless printers to a separate wireless network, or put them on a print server. The separate wireless allows you to change and upgrade security even if they don’t support it. If at all possible I’d have them wired on a separate server for safeties sake.
  • Put IPS’s on the inside of your external firewall and (if you must) an IDS on the outside. As terrible as this is I read the advice to put IPS’s on the outside of your network in a textbook recently, this is silly because you only want the IPS action to hit traffic that already made it to your network. If you want a more SIEM view of the outside then use an IDS because its passive monitoring won’t slow down flow.
  • Review your network’s structure at least once a year.
  • Confirm how much data you can actually move over your network and where the blockers are. Test this regularly.
  • Redundancy, redundancy, redundancy. At least have a spare – configured – router to the outside world (or whatever your most important point is).
  • If you work with a small budget, as many of us do, it may be worth segregating networks to two standards. Build a modern network based on the best security possible and have a legacy network for the equipment you can’t replace yet but doesn’t support all the security.
  • Confirm that your performance related settings are saved to the devices, there’s nothing more embarrassing than a configuration that worked really well before the last power cycle wiped the RAM.

Those who know me have already heard the redundancy spiel but it really does save the day when something goes wrong.

Regards,
Robert Small.

Posted in Uncategorized | Tagged , , , , , | 1 Comment

Troubleshooting connectivity

One of the things that I’ve discovered over years of troubleshooting vast and various connection problems is that it’s almost impossible to jump to the right conclusion about a problem. There are always things that surprise you, always aspects that are unexpected and always presumptions that you make about the user.

I’ve developed a set of steps based on conditions I’ve come across, hopefully they help.

This writing isn’t designed to be a great volume into how networking works, just a quick primer to the software developer or system administrator who can’t work out why their stuff doesn’t work.

Firstly check the endpoints
Can they reach out in general, can both ends touch something else (the Internet, another machine, etc)? If so then you at least have basic connectivity. Now check your service, can you telnet to it from the same server by doing telnet localhost 80? I’ve seen many times where the service either wasn’t started or a user wanted connectivity to a service that hadn’t even been built.

Where are you listening? Make sure you’re actually listening from a public IP and not just 127.0.0.1.

Test something is listening with netstat -lt or a similar tool. If you need to test connectivity without a server being built then try using netcat or your favourite replacement.

Open access test

If you have an open access VPN, a central router you can go through, a server deep inside your internal network, or an open port on the switch connected to your endpoint then it’s worth testing from there too. This eliminates host based firewalls, port conflicts, and other silent killers.

Don’t believe ping
Quite often you’ll see connectivity tests based entirely on ping. Quite often users will claim connectivity problems based on someone having taught them to ping a place on the Internet to confirm a connection problem. Networks often filter pings, take a look at my previous post.

If your network does pass ICMP traffic then run a trace route to confirm that there’s a route to the endpoint. Looking at the hops on the trace route you should be able to list the firewalls and routers which may have rules or ACL’s preventing the traffic passing.

Telnet

Test the connection with telnet. If it connects and gives you an escape character then you’re all good.

Check the DNS

If you’re hitting a name instead of an IP, check it’s going to the right place. Sometimes you’ll need a fully qualified domain to resolve correctly. You may find your machine resolves server to server.example.com instead of server.example.net.

Onwards

If you haven’t figured it out by this point you may want to start looking at firewall logs and routing tables. If you’re completely lost try working out what does work and finding out why.

Regards,
Robert Small.

Posted in Uncategorized | Tagged , , , , , , , , , , | Leave a comment

Understanding RAID levels

Why is RAID so difficult to understand? Probably because the algorithms are pretty complex and there are so many levels to RAID. Possibly because it sounds complicated, most likely because it’s too easy to get lost in the details and not understand the basic logic.

Today I’m going to explain the basic logic used in the most popular forms of RAID. This is not a perfect mathematical representation but should help the technical understand an overview of what is happening within the technology.

This is not designed as a technical specification, it’s not designed as a guide on how to build your own RAID. This is simply the fuzzy maths that make RAID more understandable to the project manager, or network administrator who isn’t directly involved in storage but has a technical background.

XOR

XOR is something we have to briefly touch here, the basic principle is that if two things are identical then it produces a 0 whereas if they are different it produces a 1.

1 and 1 is 0, 1 and 0 is 1, 0 and 1 is 1, 0 and 0 is 0. That’s all you need to know right now.

RAID 0

RAID 0 is simple, it’s multiple disks with information spread over them. The more disks you have the more information you can store. If one disk dies you lose your RAID array.

Drive A Drive B
1 0
1 1
1 0
0 0
0 1

Space = number of drives * drive space

Two one terabyte drives equate to two terabytes of usable storage; 2TB = 2 * 1TB.

RAID 1

When we get to RAID 1 things become more based on redundancy instead of storage space. You have your data directly duplicated on two separate locations. Every write happens twice and every read happens once, you can see an increase in performance as data can be read from two places simultaneously.

Drive A Drive B
1 1
1 1
0 0
0 0
1 1

Space = (number of drives * drive space) / 2

Two one terabyte drives equate to one terabytes of usable storage; 1TB = (2 * 1TB)/2.

RAID 5

RAID 5 is a way to ensure data safety without losing out on all your disk space. The drives have distributed parity which basically means that each drive carries a bit of information to restore the other drives. In a three drive array the each drive carries two thirds data and one third information to restore the other drives; effectively meaning that if you lose a drive out of a three drive array there is enough information to recover the array from the data held on the other drives.

So, how does parity work? If 1 XOR 0 is 1 and you lose the 0 you know that it takes a 0 and a 1 XOR’d to get 1. This is how you can recover a whole drive from the parts remaining in the array. Simple fuzzy maths that explains the concept for you.

In the example the parity sum is shown and the parity is made bold to make it easier to identify. You can see how the equivalent of one drive is used on parity in each RAID.

Drive A Drive B Drive C Parity Sum
1 1 0 1 XOR 1 = 0
1 1 0 1 XOR 0 = 1
0 1 1 1 XOR 1 = 0
0 1 1 0 XOR 1 = 1
0 1 1 0 XOR 1 = 1
0 0 0 0 XOR 0 = 0

Space = (number of drives – 1) * drive space

Three one terabyte drives equate to two terabytes of usable storage; 2TB = (3 – 1) * 1TB.

RAID 6

RAID 6 expands on RAID 5 to include a second layer of parity. It provides the ability to handle two drives failing in an array.

No need to show the parity sum here, it’s the same as in the example above but with more drives. Parity bits are in bold again. You can see how the equivalent of two drive is used on parity in each RAID.

Drive A Drive B Drive C Drive D
1 0 1 0
1 1 1 0
1 0 1 0
1 0 1 1
0 1 1 1
0 0 0 0

Space = (number of drives – 2) * drive space

Four one terabyte drives equate to two terabytes of usable storage; 2TB = (3 – 2) * 1TB.

Once again, this is not meant to be a full guide to RAID technologies, just a basic understanding of the mathematics behind them.

Regards,
Robert Small.

Posted in Uncategorized | Tagged , , , , , , , , | 1 Comment

Why users should never hear of the ping command.

As harsh as it sounds this is partially because of the ‘give a mouse a cookie and he’ll try to use it as an umbrella at which point you’ve done him more harm than good’ scenario.

If you give a user ping then they’ll try to use it to test everything connection related. This is fine because it by itself won’t bring down your network. However the tickets coming in telling you that ‘Server X won’t respond to pings and so therefore there must be a firewall rule preventing their service on port 8083 from working with their laptop on the wireless’ probably will make your life more aggravating. Especially when the wireless network blocks all ICMP traffic.

From a long time of dealing with connectivity errors I’ve found ping to be more an enemy to you than a friend. A lot of users use ping to check things because someone made them ping Google to test their internet connection a long time ago and it stuck for everything. The problem is that ping doesn’t check ports and many corporate networks block it for security purposes. So when ping fails it means that ping is blocked not port 8083 – which turns out not to be running the service in the first place. This doesn’t even cover the tickets that come in stating that whole swathes of the network are ‘down’ because Joe Bloggs cannot hit them with his ping command.

It also becomes a tool for extremely misguided tickets. I’ve seen ones where no ICMP traffic was allowed across the main outward facing firewall and whole tickets were troubleshot and eventually discarded because someone got a 404 error and tried to ping these server. To be fair some of these have indicated a distinct lack of networking knowledge by the person troubleshooting.

It’s unlike me to offer a problem without a solution, so here it is. Telnet. If telnet makes a successful connection then a TCP session started, the firewall is fine. Of course it doesn’t protect you from the people who don’t start their web server and can’t make a connection to a service that doesn’t exist. Another test if you are aware that there’s no service on the other end to connect to is using netcat, but that’s a discussion for another day.

Happy troubleshooting,
Robert Small.

Posted in Uncategorized | Tagged , , , , , | 1 Comment

Recovering a hardware RAID from failing disks

Recovering hard drives from a failing RAID is never fun. By the very nature of RAID you don’t really have a chance to just pipe files off a disk so you’re in for some work.

The fear with drive failure is that more drives will fail during recovery. This is why you want to favour systems like RAID 5 with hot spare or RAID 6. The good news is that all is not lost. You have to weigh up your priorities, either you prioritise the data and bring down your system or you prioritise uptime and have to run everything in serial, one drive at a time.

For this scenario we’ll imagine a RAID 5 with a hot spare, three active drives and one spare. You have had one drive failure, drive 3, and you’re starting to suspect that the others may not make it. Your RAID card supports hot swapping.

If you have the opportunity before working on your data recovery it’s advised to replace your hot spare drive. Make sure that the data that’s going to it has a stable platform to land on.

Running Live
This is when you’re told to keep the lights on and recover from backups if the system goes completely down. This is running data recovery live on your disks.

Step one
If you suspect the hot swap drive (4) at all swap it out now. Your RAID card might complain but no matter what it tells you it can survive on two drives briefly.

Step two
Now you have a stable hot swap (4) let the RAID rebuild onto that while you take the failed drive (3) to your workstation. Copy drive (3) to a new drive (5) using a utility like dd or GNU dd rescue. Do your best to get all data off of it. If you’re successful you can put it (5) back in your RAID as a replacement for the failed drive (3) and run a check and verify.

Step three
At this point you’ve either successfully rebuilt your RAID from drives one and two to the hot spare (4) or you’ve rebuilt the original RAID using a copy of drive three (5). Don’t do this until You have a stable RAID. If you have a hot spare at this point you may want to disable it to prevent excessive rebuilds.

Take your stable RAID and remove either drive one or two, whichever you believe to be faulty. Your RAID card will scream and be very upset, but remember that it can run on two drives. Repeat step two with the drive you removed from the RAID (1) and replace it in with the new drive (6) and put the new drive back into the RAID array and rebuild/verify.

Step four
At this point you can repeat step four with the remaining drive and then enable or add a hot spare as appropriate.

Step five
You’ve now successfully replaced all your drives one by one and not had a system failure. This by itself is a good thing and deserves a cup of tea. While you’re having a cup of tea run a background consistency test to confirm that everything matches up. The principal here is that it can be less wear on a drive to simply copy it and add the copy back in than to rebuild several times. It’s also faster and if you’re using it against a failing drive GNU dd rescue will have a better chance of getting raw data than most RAID utilities. The advantage of RAID is that it’ll take care of any holes in the data you copy from the failing drive(s).

Cold rebuild
If you ask me, which by reading this I’m going to presume you are, this is the better way. It’s only drawback is that your system is down for the duration.

Step one
Power down and remove all drives.

Step two
Take the old drives and put them into your external dock on your workstation and copy them using GNU dd rescue to new drives. If you have enough docks you can run this in parallel and greatly speed up the recovery process. Recover what’s possible, if you don’t get everything from one of the three active drives it’s OK as long as you do from the other two. The RAID should rebuild anything you’re missing just as long as there’s two drives functional for it to build from.

Step three
Put the new drives back into the RAID controller and boot up. Run a check and verify on any drives which were losing data. Depending on your RAID card check and verify may negate the need for a background consistency check but I’d tend towards doing it anyway.

Notes

So that’s it people. There’s no magic to it, just good tactics. You can use similar tactics with a Linux Software RAID, I may work through the exact steps for you to do that at some point.

The main point about this is that a drive which has most of the content on it from a copy (GNU dd rescue) will require far less rebuilding than a drive that’s completely blank (traditional hot spare behaviour). Of course there’s also the aspect that GNU dd rescue will provide you with more options for recovering the data segments than the average RAID controller and bundled software. Follow the link above for more information on GNU dd rescue usage, just remember you’re recovering a RAID disk not a standard partition table.

Ladies and gentlemen, you’ve been a great audience.

Regards,
Robert Small.

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Setting up a Windows 7 Virtual Machine for Cisco SDM and CCP

I recently had to build a Windows 7 Virtual Machine for use with Cisco’s SDM and Cisco Configuration Professional. As you may be aware this requires some specific versions of the required software. It seemed to helpful to link to them here.

Firefox 3.5 [SDM]:
http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/3.5/win32/en-GB/Firefox%20Setup%203.5.exe

Flash Player [CCP]:
http://get.adobe.com/flashplayer/

Java 1.6.0 update 10 [SDM]:
http://javadl.sun.com/webapps/download/AutoDL?BundleId=24936&/jre-6u10-windows-i586-p.exe

Java 1.6.0 update 11 [SDM] [CCP]:
http://www.filehippo.com/download_jre_32/4962/

SDM:
http://software.cisco.com/download/release.html?mdfid=281795035&softwareid=283768243&release=2.5

CCP (Express only):
http://software.cisco.com/download/release.html?mdfid=281795035&softwareid=282159854&release=2.7

CCP (Full version, one version behind express at the time of writing):
http://software.cisco.com/download/release.html?mdfid=281795035&softwareid=282159854&release=2.6

Notes

  • I noticed that the update 10 of Java seemed to work a little better with SDM but is incompatible with CCP, hence the link to update 11.
  • CCP seems to require IE to be set as the default browser for it to detect Adobe Flash.
  • For CCP you will need to add 127.0.0.1 to IE’s list of sites to use backwards compatibility with or it will only use half the window.

Installation
Simply install the components that you need in the same order as above and use.

Enjoy your labs,
Robert.

Posted in Uncategorized | Tagged , , , , , , , , , | 2 Comments

Cleaning up dead processes

Sometimes programs are written badly, maybe by you, maybe by other people. Sometimes you end up running those programs. Sometimes these programs leave hanging processes, either waiting for user input that will never come or expecting something somewhere to happen that it’s already missed in the grand pipe line of life.

For example, a program is waiting for a socket close, the socket timed out and the program is still waiting for a hard close (for one reason or another). You’ve debugged this far but the program is closed source, black box and from a third party vendor. You could check each day, or even each hour, but whenever it hangs it locks up another of your processes which is waiting for it to finish and your whole script gets paused.

This, strangely enough, happened to me. A process gets hung up for one reason or another and won’t finish, my script copes with the process being killed but until then it’s also hung. I’ve heard of this happening with Apache and badly written scripts too. Here’s my solution.

while [ 1 ]
do
    PROC=`ps -eo pid,etime,comm | awk '$2~/02\:..\:../ && $3~/proc_name/ { print $1 }'`
    if [ "$PROC" = "" ]; then
        #Do nothing!
        sleep 5m
    else
        ps -eo pid,etime,comm | grep proc_name
        echo "Killing $PROC"
        kill $PROC    
        sleep 1h;
    fi
done

Basically this looks for a process that’s two hours (beyond normal process run time) old with the process name you specify and kills it. You could adjust this to be one hour, three hours, ten hours or whatever works best for you. Before killing the process it will print all the processes of that name and the process to be killed so you can review the logs to see what is being seen and killed.

I know it’s very dirty and a bit of a hack, but it saves some administrative time for more important things.

Good luck!

Posted in Uncategorized | Leave a comment

Using remote trigger files to start a file transfer

The Concept

Sometimes you want to start file transfers when you’re ready. You’re generating some files and you want the transfer to start immediately after the generated files are done.

The Server

The script to sit on the server is simple. Run the command then touch the trigger.

user@Server:~$ generate_files.sh; touch ./trigger-file

This will run and then touch the trigger file. The trigger file starts the next half of the process.

The Workstation (or second server)

Now all we need to do is wait for that trigger to be created and then start to run the transfer.


#!/bin/bash

while [ 1 ]
do
    rsync --remove-source-files --times --timeout 180 --partial --progress -e "ssh -p 22" user@server.dom:/home/user/trigger-file /home/user/
    if [ "$?" = "0" ] ; then
        echo "Starting rsync based on trigger file."
        break
    else
        echo "Backing off, waiting for trigger file."
        sleep 5m
    fi
done

while [ 1 ]
do
    rsync --recursive --times --timeout 180 --partial --progress -e "ssh -p 22" user@server.dom:/home/user/source-files/ /home/user/destination-files/
    if [ "$?" = "0" ] ; then
        echo "Done, rsync completed normally"
        break
    else
        echo "Rsync failed. Backing off and retrying..."
        sleep 5m
    fi
done

Basically this loops through until the file exist then continues on to finish the transfer. The whole process uses rsync over ssh which means that you get the wonderful resuming capabilities of rsync along with the confidentiality, integrity and authentication of ssh.

This is the basis for many scripts which I use on a daily basis to automate file transfers and it’s proven most reliable.

Regards,
Robert.

Posted in Uncategorized | Leave a comment

Finding duplicate files or the wonderful tool called fdupes

The tool

The tool itself is a pretty simple one and it works on Linux, Unix and MacOSX and it’s designed to find duplicate files within a set of directories.

Installing

With Linux/Unix your chosen distro’s package manager should have a copy of it, just install it as you usually would.
apt-get install fdupes
emerge fdupes
pacman -S fdupes

With MacOSX you need Mac Ports installed and then you can just go ahead and install it.
sudo port install fdupes

Usage

Imagine the scenario, you have daily backups of a file system compressed and archived in the same format with the same parameters. Sometimes the file system changes every day and sometimes it doesn’t for weeks on end. The backups are taken if the system has changed or not. You have a year of backups and you’re sure that about 70% of them are complete duplicates, you’re archiving these backups but the archive media won’t take the current size of 200GB. Your archive media will take the 30% (60GB) which you believe to be true unique backups but how do you remove the duplicates?

The easiest way is to run fdupes with the delete and recursive options.
fdupes --delete --recurse /path/to/backups/

You can even run it so it skips the first entry and deletes all the rest automatically.
fdupes --delete --noprompt --recurse /path/to/backups/

Just remember automatic deleting can cause data loss and you can’t guarantee which copy will be deleted. Experiment first.

Regards,
Robert.

Posted in Uncategorized | Leave a comment