31 March 2015 / Troubleshooting

My thoughts on handling a system outage

If its going to happen then it'll happen at the worst possible time. It'll happen that friday evening just before going home/beer o'clock and it'll be a long weekend plus whatever system dies a death will be the very system that you have logged on to exactly once, several months ago and that was in error.
In short, its going to be the one system that you know nothing about and the one that normally just works.

And then the phones will ring with people demanding action because it just so happens that the boss type wants something from that system before he leaves for the day and no it cannot wait because it has to be right now.

So what do you do in these situations?

Believe it or not, the answer is "nothing" - at least at first.

No matter the issue, no matter how many people are telling you to get it fixed now the very worst thing you can do is try things out 'to see if it works'.

You might get lucky but you probably won't and by trying things out at random you will turn what is probably a simple thing into an epic hunt to track down what it was you changed just to get the system back to how it was before you 'just tried something'.

Any system that breaks needs to be treated like a crime scene, there is evidence there of what caused it to break. This evidence needs to be collected, something as simple as a reboot may well fix the problem but it may not and in rebooting you could lose that evidence and it may well be the very clue that is needed to stop the problem from happening again, possibly on multiple systems. So what to do?

At this point its very easy to bow to pressure and try something, anything to get it working and get out of there for the weekend but this potentially puts you into the above category where you'll be fighting to get it back to a known, broken state!

Firstly, preserve the evidence. If the box has blue screened, take a screenshot via drac or ilo or on your phone and only then reboot it.

Once it has rebooted grab a copy of the dump file, there are some excellent online tools that will analyse the dump files for you.

If the box hasn't blue-screened then try and grab a copy of the state of the machine - what services are running?
What applications are running?
What is its ip configuration?
How busy is it?

Secondly, preserve the logs.
Take copies of the system and application event logs.
If the application has its own logs then copy those.
Ideally, all logs should already be sent to a syslog server, of course this is fine for linux but what about windows? Again, there are agents for windows that will perform this task admirably.

So, now you've got some basic evidence, what next?
This all depends on the system but how people access it, for example, is it ok from inside the company but broken from outside? If so, the server is fine but you may have a connectivity, firewall or load balancer issue.
If its broken from inside and out then its probably the server.
No matter the issue, basic connectivity tests are a good place to start.
Can the server contact its default gateway?
Can the server contact a server in another vlan?
Can the server contact the internet?
Googles dns servers at 8.8.8.8 and 8.8.4.4 are wonderful for connectivity tests!
Can the server resolve names? Nslookup is the best tool here.

I'll expand more on this in a later article but suffice to say, the 7 layer osi model can be a handy reference for troubleshooting. Working 'down' the model from application to physical is a good, methodical way to troubleshoot.

In summary, both logs and the state of the machine represent the digital fingerprint of the issue. Its important to preserve them. It shouldn't take more than a few minutes to gather it, you need to make sure that you keep it together.

One other important thing to note is that once you've found the problem and if it is a bluescreen or crash due to a bug, driver, patch,etc its important to check out other systems that could be vulnerable to the same issue as this potentially will save you or a colleague from another nightmare friday night troubleshooting scenario.

My thoughts on handling a system outage

How to do DNS correctly

David Cameron wants to ban encrypted messages

Subscribe to Ramblings of a Sysadmin

Subscribe to Ramblings of a Sysadmin