Having worked in corporate IT and in smaller environments I’ve seen all sorts of scheduled midnight reboots and random bounces. There are lots of policies and official reasons for rebooting at set times and documenting your procedure all the way up to and including the reboot after an upgrade.
I’d like to put my two pence (or three cents at the current exchange rate) on the whole issue. Having spent many many hours of my life running troubleshooting calls and trying to drill down to the exact issues and where they come from after large scale reboots and upgrades I’ve developed a theory that I use in my own labs and on the networks that I run.
It’s simple, don’t reboot everything at the same time
This may sound like an overly simple solution, but lets look at the drill down in a basic example. Say you have twin (load balanced) IDS systems between your network and the outside world in the DMZ, then two separate IDS systems in your infrastructure network and in your user network. An update to the IDS signatures is released and you want to roll it out to your network. You have two options, push the update to all devices, reboot and minimise disruption; or, push out to the least important device and then subsequently to the other devices as the update proves itself stable.
If you push out to every device at the same time and the update brings down all your IDS systems you could be without IDS or even without connectivity for an extended period of time as you revert and reboot all devices – and that’s only if you immediately work on the IDS systems instead of testing everything else first. If you push to one device at a time worst case scenario is that you have one device go down and you have to revert it to backup, if that device is redundant you don’t even have an outage.
Although it takes longer it allows you to debug which component failed if a component fails. It also allows you to avoid pushing updates which pull down every device in your infrastructure.
In a bigger view you might have code pushes coming from your development environment into your testing environment. You have changes to the code used to generate your database, changes to the service bus, changes to the load balancer settings, changes to the web server and changes to the code running within the web server. Lets say you have a bug in the service bus code and you don’t know about it as it worked fine in development but that’s a slightly different environment. There are two ways that this can go: you can push all the updates and work through each one trying to diagnose if it’s related to the sudden collapse of your testing environment, or you can push updates one at a time testing after each and hopefully finding the service bus issue immediately after pushing.
It’s not a perfect concept, sometimes two updates are tied together but in a world of ever decentralised API’s and SOAP calls which should be version independent you can break down your updates quite a bit.
It’s something small and seems obvious but when it’s followed it can save hours of troubleshooting and fire fighting calls.
Fair winds and few fires.