3 June 2017 / Outage

A follow up on the British Airways outage

More details have emerged on the BA outage with a detailed article being posted on the register.

As with my previous blog, I'd like to go through the register article and add my own thoughts.

From this comment it sounds like there was some sort of scheduled maintenance occurring on the UPS systems. This isn't unusual as UPS systems are physical and need maintenance, especially on the battery systems. It's also useful to run the UPS on a test load to see how they cope as UPS systems don't get used that often so when they are required to kick into action and suddenly have a load, that's when they fail.
In this case, it sounds like it was scheduled maintenance that caused the problem.

This is something else that has been mentioned on twitter quite a bit, it does seem that BA were having some issues before hand. Of course, that's not unusual, the more servers you have the more chances that something is going to play up. While almost certainly unrelated to the power issue at the DC, it wouldn't have helped things at all.

The rumours doing the rounds are that this power supply powers both DC's but I find that hard to believe as then you don't have two DC's, you just have the one because a vital component is shared. I suspect that no matter how bad BA are, this is something that they know and have setup properly!

What I suspect actually happened here is that the power went off or was shut off with the UPS in bypass. When this happens it doesn't matter how quickly the generators start up, the datacentre is going down and it's going down without anything being shutdown cleanly.

At this point there is a chance that the engineer who shut down the power down may have turned it back on, all the kit that was shutdown would immediately start up again, drawing too much power on the busbar, tripping fuses and causing another outage. At this point there is also a pretty good chance that power supplies would have failed due to the power being removed, coming back, surging and then failing.
It's always best to have a server go to shutdown on power failure as this prevents it attempting to start up when the power goes out. It's always best to assume that when the power comes back, it might go out again so the last thing you want is everything starting up at the same time and tripping fuses, one of the worst power hogs during start up are SAN disk shelves especially when they are full of spinning disk as it takes a lot of power to kick the SAN into action and to spin up the disks.

This is a very interesting comment as it suggests that the fail over DC could never work. A fail over DC is supposed to fire up in the worst possible scenario, the total failure of the primary DC which is exactly what happened to BA and yet the failover failed due to "corrupted data". Depending on the type of data that is synchronised, there should be checksums to ensure that corrupt data is discarded. This is something that BA need to look into as it's clear that the secondary datacentre could never have taken over due to this "corrupted data" issue. This also suggests that BA have never tested any sort of unscheduled failover.

Another question is if the power maintenance works were communicated to BA, and if they were, could BA have commanded a fail over to the secondary DC?

There is still a lot more that I think we'll learn from this. There are also a couple of other outages I'm following such as the recent Atlassian outage for which they still need to provide an outage report.

A follow up on the British Airways outage

Tracking down the cause of high VMWare network traffic using pktcap-uw

You can't have too much monitoring

Subscribe to Ramblings of a Sysadmin

Subscribe to Ramblings of a Sysadmin