This article was originally written by Naomi Eide and published on CIODIVE.
On Monday, Delta had a no good, very bad, rotten day. Early in the morning, the company had a power outage in Atlanta that led to cascaded failures throughout its computer systems, affecting the airline’s operations systemwide.
“Following the power loss, some critical systems and network equipment didn’t switch over to Delta’s backup systems,” the company said in a statement. Delta’s investigation into the root cause is still ongoing.
The outage caused widespread delays and led to the cancellation of about 1,000 flights Monday, but Delta was still able to operate 3,340 of its nearly 6,000 scheduled flights, according to a company statement. Another 250 flights were cancelled early Tuesday.
“We were able to bring our systems back online and resume flights within a few hours yesterday but we are still operating in recovery mode,” said Dave Holtz, senior vice president of Operations and Customer Center, in a statement.
Delta is not the first airline to face widespread computer system failures, and it likely won’t be the last. In 2015, Quartz began tracking the tech glitches plaguing airlines and preventing them from operating normally. Since then, it has tracked 24 significant airline system failures.
“Without knowing more about what really happened, you don’t know whether it’s a black swan event or whether this was a piece of carelessness or cost cutting or poor redundancy design,” said John Parkinson, Affiliate Partner at Waterstone Management Group. But at the root of the problem,”there’s a bit of system architecture issue going on here,” he said.
It’s an event that has led to a lot of down time and angry customers. Against the backdrop of so many similar delays in recent months, it raises the question about how airlines prepare for such issues — and how well they can deal with them. And for CIOs increasingly held responsible for a company’s bottom line, there may be a few object lessons.
The outage affected Delta’s entire network because the companies control points are centralized. Though running route scheduling, ticketing and check-ins over a single network is cost effective, the system only runs smoothly while the central control points are available.
“As soon as one piece fails, you tend to get cascade failures,” Parkinson said.
Companies could look toward regionally dispersed control centers, even though they’re a bit more expensive, according to Parkinson.
“It’s much harder to take the whole system down if there’s that level of dispersion in its design,” he said.
Redundancy, redundancy, redundancy
It is possible to avoid widespread system outages, but to do so, companies have to ensure redundancy measures are in place and fully operational.
“I know they had redundancy. Every big company has redundancy,” said Zubin Irani, founder and CEO of cPrime. “My guess is the redundancy didn’t work.”