This week Blackberry has been hit by two outages, both of which appear to be caused by single points of failure (SPOF) within the RIM infrastructure.
In the news today Blackberry said _"The messaging and browsing delays... in Europe, the Middle East, Africa, India, Brazil, Chile and Argentina were caused by a core switch failure within RIM's infrastructure" (Source: http://www.bbc.co.uk/news/technology-15243892_They also said "
This immediately causes me to ask a few questions:
- Why wasn't the fail over triggered manually?
- What was missed in the testing of the switches fail over?
- Was this an existing issue?
- When was the DR plan last tested?
- Had changes been made which invalidated the DR plan?
And this is the major point of DR testing. You can't over everything so sometimes you will have to learn from failures that impact you and incorporate those failure modes into future testing but you should also have the ability to be able to manually failover to be able to quickly recover from a systems problem.
You also have to be absolutely aware of changes that are made which could affect your DR plans and this means every change has to be screened to ensure that you aren't creating an SPOF or that if you are then everyone is aware of it and plans are put forward to plug that gap.
The key in any major outage is to get the system back up, even if it means failing over manually - however, any steps taken to recover the service should be noted in an emergency change request of some description and once this is done and the systems have been recovered it is vital that the change notice is thoroughly reviewed to find out both what went wrong and what could go wrong because recovering from an outage is one thing but it's all for naught if that recovery leads to a potential problem which will bite you later on.
ITIL processes teach a lot of this and implementing these practices can be a pain but its a choice. You either suffer the pain of the paperwork or the pain of the outage.
At least if potential problems are known about they can be more easily dealt with when they appear and bite you and they will appear.
The mobile industry is very much a cut throat industry and this dual outage with Blackberry will do them no good at all because others will seize upon it as a sign of Blackberries weak infrastructure and they will be right.
To recover from this Blackberry need to do a through review of their systems and DR processes and ensure that if this happens again they have the ability to recover from it very rapidly. They are, after all, reliant on their userbase for their income and they have failed a major test.
Subscribe to Ramblings of a Sysadmin
Get the latest posts delivered right to your inbox