A World with No Backup Plan

Digital systems and AI are taking over many functions before software designers can make them fail-safe. Is there time to fix this?



See the new issue of Briefings magazine, available at newsstands and online.

American Airlines executives (and hundreds of thousands of the carrier’s passengers) got a serious scare late last November. As the Christmas travel season began, the airline found it had no pilots for 15,000 flights scheduled for the last two weeks of 2017. The problem was—as it often is nowadays—a software glitch. The airline’s algorithm for matching planes and flight crews should have granted time off in accordance with schedule needs and seniority. Instead, it gave time off to any pilot who had put in a request. There was no procedure in place to repair the breach.

Fortunately, American was able to come up with a fairly low-tech solution to a modern-age problem: bring humans in to improvise a solution. The company offered pilots 150 percent of their normal pay to work unstaffed flights, and quickly negotiated other details of an emergency plan with the flight crews’ union.

Another day, another glitch. It was a fairly typical digital hiccup, the sort of which we’re all getting accustomed around the developed world. In organizations, such mishaps ground flights, shut down phone systems, mislead medical devices or crash autonomous vehicles. In private lives, the same phenomenon freezes video calls and creates those moments when you try to type “but maybe I’ll come” and find your instant message “corrected” to read “Bugs Mugabe will cope.”


We shouldn’t be too quick to get used to these surprises. All these supposedly minor incidents add up to a major problem that is worsening with every new step toward a completely digitalized society: More and more often, complex systems of software and hardware fail unexpectedly. And when they do, we discover there is no Plan B and no safety net.

The incidents are increasing as we further digitize our lives and connect more of our tools into one infinitely complicated web. As the science writer Fred Guterl has noted, in 21st-century society, all important infrastructure is computer-controlled, waterways as much as airports. Getting a barge down a river, he says, depends on “water-level monitoring, navigation, signaling and locks—all of which are in some way under computer control. Even the rivers are digital.”

So are the people. As the game designer and Georgia Tech professor Ian Bogost points out, we are relentlessly turning all the familiar devices of home and work—the can opener, the toaster, the garden hose, the overhead light—into computers. Today, for example, you can buy smartphone-enabled bike locks, faucets, propane tanks, juicers and baby monitors. The market for these sorts of Internet of Things devices, all capable of gathering and crunching data and of talking to each other, is expected to reach more than $275 billion a year by the end of the decade.

“There’s not much work and play left that computers don’t handle,” Bogost wrote in The Atlantic. “People don’t seek out computers in order to get things done; they do the things that let them use computers.” Why else do you need a blender that reports to your smartphone?

Indeed, it is a transformation that is easy and satisfying—so much so that most of us are on this road without giving it much thought.

Until, that is, something goes wrong.

Many of these surprise failures don’t end nearly as well as the American Airlines incident. On July 8, 2015, trading on the New York Stock Exchange was suspended for four hours because a software update didn’t go as expected. That very likely contributed to a 1.7 percent drop in the S&P 500. Worse, in May that same year, an Airbus A400M Atlas military transport crashed near Seville, Spain, after computerized controllers slowed three of its four engines. The cause was a mistakenly erased file, which caused the digital engine controllers to misread data from the engines. Four crew members were killed. Similarly, Asiana Airlines blamed poor software that “led to the unexpected disabling of airspeed protection without adequate warning to the flight crew,” resulting in a Boeing 777 crash near San Francisco in 2014. Three people were killed.


According to a review of FDA data by a group of thoracic surgeons, surgical robot mishaps were involved in 144 deaths and 1,391 patient injuries from 2000 to 2013. To be sure, as critics point out, that’s an extremely low failure rate amid a total of 1.7 million surgeries, and no one compared it with humans-only surgeries. But the study makes clear that no one should assume that robots will always perform flawlessly.

Nor can we apparently count on the US system for emergency calls to firefighters, police or medical help. Once a string of locally run exchanges, programmers years ago created a national 911 system that included a simple code telling the server to assign 40 million unique ID numbers. It reached that limit on April 10, 2014, leaving 11 million people in the entire state of Washington and parts of California, Florida, North and South Carolina, and Minnesota that night with no emergency-call service for six hours. The callers who heard a busy signal or a dead silence, according to an FCC report on the incident, included “calls reportedly involving domestic violence, assault, motor vehicle accidents, a heart attack, an overdose and an intruder breaking into a residence.” (That particular caller, in Seattle, tried 37 times before finally chasing away the attacker with her kitchen knife.)

To be sure, one reason our intelligent systems stumble like this is due to our own overestimations of technology. Many people confuse artificial intelligence with the human version, blinding themselves to technology’s limitations, argues the AI pioneer Alan Bundy, a professor of automated reasoning at the University of Edinburgh in Scotland. “Any machine that can beat all humans at Go must surely be very intelligent, so by analogy with other world-class Go players, it must be pretty smart in other ways too, mustn’t it?” he recently wrote in Communications of the Association for Computing Machinery. “No! Such misconceptions lead to false expectations that such AI systems will work correctly in areas outside their narrow expertise.”

The fact is, Bundy said, even the most advanced computing technology has been designed to master a narrow range of functions. Outside of those, it is no smarter than a person, and often—because of limits on our ability to map and create intelligence—is quite a bit dumber. Failing to remember this can lead to overconfidence that a complex, algorithmically controlled system cannot go wrong.

Such assumptions probably fostered one of the first deadly failures of a high-tech software-controlled system. Over the course of a few months in 1985 and 1986, a radiation-therapy device called the Therac-25 overexposed six patients to its rays, leaving four dead and two seriously injured.

Predecessor versions of the machine had been operated by a human technician; they’d also had mechanical fail-safes that made it impossible for the technician to exceed safe dosage levels. But the Therac 25 passed many of those former human-operated tasks to the computer-controlled system. And the design removed the hardware safety features, relying on the software to detect and respond to trouble. According to a report on the failure by MIT professor Nancy G. Leveson, an expert on software flaws, the Therac 25’s human operator sat at a computer terminal, reading lookalike messages that didn’t distinguish minor problems from life-threatening anomalies. The machine reported “malfunction” all the time, usually for very minor problems. Operators became used to responding to the messages by quickly resuming the treatment.

In this tragedy, of course, the digital process did involve human beings, but in a way that made it impossible for them to take meaningful action. In other circumstances, people aren’t blinded by confidence in computers—they’re simply blind. Many of the systems on which we depend can’t be completely grasped by the human mind.

Ultimately, it isn’t all overconfidence, or lack of knowledge about what software does, that keeps us from making our new tech fail-safe. Unlike mechanical systems of the 20th-century industrial age, they aren’t rooted in reality. Indeed, code doesn’t have any natural connection to the objects and processes it is controlling. A screwdriver will fit into a screw, and a rivet will bind steel pillars, but the zeroes and ones of a program could be supplying recipes or plotting missile trajectories. At the level most coding is done, there is no difference.


This abstraction from the world it controls means that code can be technically flawless and nonetheless go wrong—because its programmers didn’t anticipate the problem it tripped over, or how it would interact with other software, or how people would use it. During last December’s massive fires in Los Angeles, authorities were forced to warn drivers not to follow directions from mapping apps. The reason: The apps, designed to steer users to roads clear of traffic, were sending drivers toward highways that were empty—only because they were engulfed in flames. “Software failures are failures of understanding and of imagination,” wrote author and programmer James Somers in a recent article in The Atlantic.

To counter this, and offer more protection, most experts say the effects of complexity, speed and abstraction must be ameliorated. Somers, for one, argues that programmers need to be encouraged to think about the real-world problems their algorithms are supposed to solve.

Another possible defense is to pay more attention to the shape of the human minds that must interact with machines. For instance, a growing field of research in autonomous cars concentrates on how to make them able to deal with human beings. It’s an essential task as the sector ramps up. The state of California, which tracks autonomous-car accidents, has found that the vast majority are caused by the fact that robot cars drive, like, well, robots. Human drivers don’t expect other cars to halt completely at every stop sign or obey every speed limit and traffic sign; failure to recognize this caused almost all 43 fender-benders involving self-driving cars in the state. One famous fatal accident involving a computerized driving system—the Florida crash that killed Joshua Brown as he drove in his Tesla—occurred after the human ignored warnings (to place his hands on the wheel) that a computer would have attended to.

Most digital development doesn’t take place in such a realm of far-reaching principles. In the main, it is high-pressure, fast-paced work, in which the temptation is usually to just solve the problem at hand with whatever is handy (including ready-made, “off the shelf” code or old legacy code that can be tweaked). Nonetheless, as the disconcerting failures without backup occur more and more often, the stakes are becoming clear. We need to pay more attention to the digital home we are making for ourselves.

(click the image to enlarge)


Download the PDF