Practicing Safe Software
From the Gemini and Apollo programs to today’s space shuttle missions, computer programmers have learned to live with bugs.
The computer I used to write this article measures 8 inches by 11.5 inches by 2 inches. It weighs six pounds. Although already obsolete (it has only a 286 processor), it is faster and has more memory than the 70-pound computer on board the Apollo spacecraft that carried astronauts to the moon and back. Yet my computer cannot calculate a spacecraft’s speed and heading, then calculate the small rocket burns necessary to change that trajectory; nor could it help the lunar module make a soft landing and then aid its rendezvous with the command and service module for the return to Earth. What does my computer lack? Software—the set of instructions that told the Apollo Guidance Computer how to do its job.
By the numbers, Apollo guidance and navigation software is not very impressive. It totals some 40,000 lines of computer code. A typical word processing program is many times larger. Xyquest’s XyWrite 4.0, for example, runs 400,000 lines. What is impressive about the Apollo software is its reliability; lives depended on it at a time when computer programming was in its infancy. As Margaret Hamilton, who directed the programming of all Apollo onboard software at the Massachusetts Institute of Technology’s Instrumentation Laboratory, put it in one of the almost daily memos she wrote during that time, “One of the main differences between the Apollo software and other software is that the former had to work the first time it was ‘tested’ in its real environment. There was no second chance.”
By the time the space shuttle started flying in 1981, the techniques to keep software reliable had advanced. With the accelerated improvements in computer hardware that followed the introduction of the integrated circuit, programmers could rely more and more on the processor’s memory and speed to automate communication between man and machine. And building on the experiences of Apollo and other computer-intensive projects, programmers themselves grew wiser. Programmers have learned how software breaks, according to Robert Hinson, chief of the Shuttle Data Systems Branch at NASA’s Johnson Space Center in Houston. And yet during a mission as recent as 1992, a space shuttle computer became stymied while executing a program it had run millions of times before. Programmers have also learned that bugs can hide, only to appear at the most inconvenient times.
The story of Apollo software reliability begins years before the first moonshot; one might trace it to a launch almost exactly seven years before the Apollo 11 landing, a launch remembered for one of the most spectacular bugs in space software. John Norton, a guidance software expert with TRW, watched the pre-dawn launch of the Mariner 1 space probe from Cape Canaveral on July 22, 1962, with his fingers crossed. As the guidance control officer for the Atlas booster rocket, he was responsible for the first five minutes or so of the flight, until the Atlas finished its job and separated from the Agena upper stage. At that point, Mariner would be on its way to Venus. But two errors doomed Mariner 1.
First, the guidance software contained a tiny bug. A symbol was missing from the guidance equations, part of the specifications that the programmer used to write the computer code. The missing symbol was a bar, which in mathematical notation signifies taking an average of the variable beneath the bar. The ground-based guidance computer needed averaged data in order to share the data between the two radar systems that guided the rocket. One of these systems failed during launch: the second error. The launch could have succeeded with just the remaining radar—except for the missing bar in the software. As a result of that omission, the computer processed the data incorrectly, saw erratic behavior where there was none, and, in trying to correct the “problem” (with telemetry to the rocket), caused true erratic behavior. And that’s what the range safety officer noticed four and a half minutes into the flight, causing him to destroy the rocket.
The Mariner 1 bug has become the stuff of myth. Computer programming textbooks tell the story in introductory chapters as a cautionary tale. Norton did not write the code, but he had ultimate responsibility for approving it. As a result, he became the subject of a second myth to come out of the Mariner incident. As Hamilton states the myth: “Norton took the crash very hard and devoted his life to finding errors in Apollo.”
As with most myths, there is probably an element of truth to this one. Norton did carry a newspaper report of the accident in his wallet for years, and the incident could have ratcheted his already legendary vigilance. Barry Boehm, a former chief engineer and colleague of Norton’s at TRW, where Norton is still a senior software engineer, says programmers there coined the term “Nortonize.” “If your design had been ‘Nortonized,’ says Boehm, “you had a significantly higher level of confidence that it would work.”
“Mariner was several years before Mercury, and that was the scary part,” says Norton today. “We fully realized we could not guarantee perfection.” But he worked to get the Apollo flight software as nearly perfect as it could be. He was hired by NASA to examine the code for anything wrong or inconsistent or just pain unusual.
For example, the astronauts wanted displays in feet per second, but most calculations used meters per second; Norton checked the conversions. Or when a program was converting angles around a circle to units called radians, the programmers used 22/7 as the value of pi, which, while not wrong, is not as accurate as the decimal approximation, 3.14159.
Most significantly, Norton double-checked the program line by line, translating it into the guidance equations the code was directing the computer to solve. The results of this unautomated computing Norton calls “programmed equations.”
The program that Norton annotated was written in assembly language, each line a cryptic, hard-to-read instruction only barely removed from machine code—the 0’s and 1’s computers understand. “It’s very difficult to pick up an assembly language listing—no matter how well annotated—and figure out what was going on,” says John Garman, one of the computer experts at mission control during Apollo. Garman says Norton’s documentation was “almost a handbook for the onboard software,” and the distribution list for the programmed equations grew from 33 to 198 copies.
By writing out the equations, Norton studied what the computer was being asked to do. “Programmed equations,” he says, “was the difference between riding in the car as a passenger and driving for yourself. By driving yourself, you have to pay attention to all the details.”
“Norton found more errors by scanning than all the errors found by testing,” says Hamilton. He was so fast and so thorough that Hamilton and others at MIT and NASA, most of whom had no contact with him beyond his memos, formed a picture of Norton working late, subsisting on TV dinners, and churning out programmed equations overnight: fast, precise, computer-like. Even today’s chief software luminary, Microsoft’s Bill Gates, recalls that as a senior in high school, he idolized Norton. “He was a god!” Gates told the authors of the book Gates. “He would take a piece of source code home, come back and just totally analyze the thing. Just a high-IQ act.”
But even with a secret weapon like Norton, MIT’s instrumentation lab subjected Apollo software to “endless testing,” in the words of Garman, who is still at NASA’s Johnson Space Center. Onboard software went through six levels of testing before it ever left MIT. First, small modules of code that performed just a single algorithm were tested to make sure they were computationally correct. Each subsequent level of testing checked the code at increasing levels of integration to verify that separate modules worked together, passed data back and forth, and shared the computer’s tiny erasable memory correctly.
The resulting Apollo software exhibited a feature that, though common today, was innovative for its time and contributed to its robustness in the face of uncertainty. As an engineer would put it, the software was asynchronous and priority-driven. That means that if it is running one task and another with a higher priority comes along, the computer saves the interim results of the lower-priority job and starts the more important one. When finished with the high-priority task, the computer picks up where it had left off. That contrasts with the then-more-common “boxcar” approach, in which tasks are carried out in a specific order, one after another, with each cycle repeated until finished. The main safeguard of Apollo’s priority-driven system was that the computer could not be prevented from performing a critical function by getting hung up on a potentially unsolvable problem; it would be less likely to get caught in a loop, in other words. The Apollo computer had 20 milliseconds to complete a cycle. At the end of that period, the computer would begin again with the highest priority functions.
At the Manned Spacecraft Center (now the Johnson center), the software was put through its final test in “integrated simulations” involving the astronauts and the flight controllers. “In running these simulations, which tied mission control to the crew chamber, people played like it was real, but the failures were faked,” says Garman. He and the simulation instructors cooked up some failures involving computer bugs. On the last integrated simulation, 11 days before the launch of Apollo 11, a program alarm went off during the descent of the lunar module. Steve Bales was the controller in charge of guidance for the LM, and he had no idea what the alarm meant. He called an abort, with the LM 10,000 feet above the lunar surface. “I had a hard time explaining my actions” after the simulation, Bales says. “Something was going on we didn’t understand, so I thought we should abort.” The program alarms were in part debugging aids, useful to programmers as they developed the programs; they were built in to let a programmer know that the computer was overloaded, unable to finish all the tasks in its execution frame. Mission planners never expected them in real time.
After the aborted simulation, flight director Gene Kranz assembled the controllers, Garman remembers, and told them to develop a response for every program alarm. There were about 40 alarms. “Most were innocuous,” Bales says, “but about 10 were in a class requiring judgment.” For these, Garman says, “the notes we wrote were to the effect that if the alarm doesn’t happen too often and nothing else seems wrong, then the best thing is to just proceed.”
As it happens, Bales was the guidance controller on duty for Apollo 11’s landing on the moon. Exactly 316 seconds into the descent, Buzz Aldrin reported a “1202” program alarm, one of those requiring judgment. Forty seconds later the alarm repeated.
“That was a shock to our system,” says Bales. “We had 10 to 15 seconds to decide what to do. I remember Jack [Garman] talking in my ear, saying ‘It’s not coming too fast, it’s the same type we had before.’ ” Bales called “Go” to the flight director. The alarms recurred three more times before the landing. Because of this distraction (and because they had to fly past the landing site, which was strewn with boulders), the astronauts lost track of where they were, and it took mission control a few hours to pinpoint their location.
It took even longer to determine why the alarms occurred, but the source turned out to be extraneous data from the rendezvous radar. The radar had no role to play in the landing but would be used by the LM after takeoff from the moon for return to the command module. Initial mission procedures called for the radar to be shut off during the landing, but at the last minute it was decided to leave the radar on in case the landing was aborted and it was needed. What mission planners didn’t realize was that while the LM computer was busy carrying out the tasks necessary for landing, it was also processing data from the rendezvous radar.
“The computer was interrupting itself hundreds of times a second, adding and subtracting bits from memory,” says Garman. “Just the act of doing that addition and subtraction stole 15 percent of the computer’s available time.” Carrying out the tasks necessary for landing took about 85 percent of the computer’s available time, so the added work sometimes pushed the computer to the end of the cycle before all tasks were completed, triggering the alarms.
“Had the radar noise problem taken 20 percent of the computer’s time, it’s not clear we could have landed,” says Garman.
“Our software saved the mission,” Hamilton says, “because it was asynchronous—it bumped low-priority tasks. Without it, the mission would have aborted or crashed on the moon.”
Software and a quick-thinking programmer also saved the lunar landing of Apollo 14. In the lunar module Antares, Alan Shepard and Edgar Mitchell were on their 13th revolution of the moon, preparing for their powered descent to the surface. Back at mission control, flight controllers monitoring Antares’ instruments received a jolt: intermittent abort signals from the LM. It was as if one of the two abort buttons had been pushed, though of course it hadn’t. Although the buttons had no effect during the lunar orbit phase, as soon as powered descent began, an abort signal would cause the computer to activate the ascent engines and begin other steps to facilitate a rendezvous with the command module. An abort signal, in other words, would end the mission.
Alerted to the difficulty, Mitchell opted for what frustrated homeowners confronted with balky electronics always try first: he tapped the instrument panel with his penlight. The abort light went off. When the light came on several more times, Mitchell again tapped the panel, each time with the same effect, indicating to him “that we had a foreign object, probably a solder ball, floating around in the switch” and causing intermittent short circuits.
Two hundred forty thousand miles away, Don Eyles, the man who had written the program for the lunar landing, was in his office at MIT’s instrumentation lab. It was after midnight, but it was customary for contractor personnel to be on call during missions, and Eyles’ software was on the line. The hardware was at fault, but successful continuation of the mission would depend on software. Notified of the faulty abort signal, Eyles grabbed the program code. “My first reaction was that it wasn’t so serious,” he remembers. “But when the single repeated, I thought there might be no good way around it. Then I looked at the code and it became an ingenuity thing, a problem to solve. I saw it as my responsibility—it was my code. If anyone was going to see a way around it, it was me.”
By all accounts, Eyles was the right person for the job. Fellow programmers describe him as very bright and creative and, more importantly in this situation, able to think on his feet. The problem he faced was that as soon as the LM began powered descent, the computer would begin monitoring the abort switch several times a second and would stop the landing if it detected the abort signal.
One solution immediately presented itself: turn the monitor off so that the computer would not detect an abort request. The abort monitor is controlled by a single binary digit in a 15-bit flag-word. That bit controls the state of the monitor—1 means the monitor is enabled, 0 means disabled. At the ignition of the descent engines, the bit is set to 1; to disable it would require that Mitchell key in software commands. But Mitchell would have to wait until the monitor was turned on at ignition to key in the workaround. That was deemed unjustifiably risky: If the random abort signal surfaced during the time that Mitchell was punching the keys, the mission would abort.
Eyles had to figure out a way to disable the abort monitor so that it had no period of sensitivity during which the random signal would cause an abort. And he had to work fast. Antares would make an extra orbit of the moon, lasting less than two hours, but further delays would jeopardize he mission.
As Eyles told me this story in a small conference room at the Draper Laboratory (formerly the Instrumentation Laboratory), where he now works on software for the space station, the voices of shuttle astronauts and mission controllers could be heard in the background, piped in so that laboratory personnel can monitor a mission if necessary. Eyles opened the bound volume of the Apollo software listing to the page that contained the abort monitor—code he wrote more than 20 years ago. The entire routine took only 24 lines of assembly language code.
“I saw that the monitor would not function once it saw that an abort had been called for,” Eyles says. “So I designed a procedure to set an indicator—called the mode register—to read as if the abort program were under way, so that the monitor would no longer check the state of the abort switch.” After all, why continue to check for the abort signal after an abort has commenced? In the short time he had, Eyles wrote the workaround, ran it on a simulator at MIT to see if it worked (the first attempt didn’t), and read it to mission control for more tests. Eyles says he did not feel an unusual amount of pressure. “It was one of those adrenaline moments,” he says when pressed.
When the LM came around the moon, with about 15 minutes before the engine burn to begin powered descent, the capsule communicator read the procedure to Mitchell, who keyed it in. The fix worked flawlessly.
Apollo, with its single computer, followed a philosophy of attempting recovery from any failure. The space shuttle borrowed some of Apollo’s mechanisms of fault tolerance but added others. First, there are four identical guidance and navigation computers on the shuttle, to guard against hardware failures. If one computer gives a solution that differs from the rest of the pack, the astronauts assume a failure and turn it off. Second, there is a backup—a fifth computer running independent software capable of managing ascent, abort, and reentry. The backup protects against a software bug affecting the four primary computers.
To simplify the task of writing and revising software for the space shuttle, NASA hired Intermetrics, a Boston company, to create a high-order language, HAL/S (only coincidentally similar to the name of the homicidal computer in 2001). Another program, a compiler, translates HAL/S into machine code for the computer to execute.
To inspect the software, engineers no longer scan lists of assembly commands that perform the same function as pushing the buttons on a calculator. Instead they read expressions based on the logic of the English language and can recognize mistakes and inconsistencies more easily. “The code isn’t as tight,” says John Garman. “The programs run slower and take up more space. But the advent of faster computers with more memory made the use of high-order languages possible.
“That’s one of the reasons word processing software is so rich and user-friendly. It runs slow and takes up tons of memory. But if you want to change the heading or the margins on a document, you make one change instead of one for each page.”
Teams of programmers still inspect the software in discrete stages against a checklist carved in stone: first to make sure it is asking the computer to perform the calculations that the programmers want it to perform, then to make sure that data the computer retrieves from other sources for the calculations are current, and so on.
“We realize people are human and humans are going to make mistakes,” says a former IBM manager responsible for shuttle software development and maintenance. Today Loral Corporation has that contract. “You have to design a process that looks for mistakes it assumes are there. You have to put enough eyes and people to prevent single-source failures. The chance of six people looking at the same code and missing an error are much less than one person missing the error.”
Add layers of simulations to the inspections and it’s hard to understand how errors creep through. “Errors of rare occurrence—those are the ones that drive you crazy,” says Dan Lickly, one of the key members of MIT’s instrumentation lab during Apollo days. “You may simulate thousands of times and not hit the error.” A rare one surfaced during Endeavour’s 1992 mission to rescue Intelsat VI.
In preparation for the rendezvous, the shuttle computers were calculating when and how long the rockets of the Orbital Maneuvering System should fire. The procedure is for the computer to calculate the burn several times before the actual firing. As the shuttle gets closer in time and space to the satellite, the calculations will be more accurate.
For each targeting calculation, the computer runs 10 iterations of the equations to find the answer that will put the shuttle within the desired distance of its target. Software designers built in a limit to the number of iterations, however, to avoid an infinite loop. If the desired distance isn’t computed within 10 iterations, the computer reports that it “failed to converge,” precisely the message that Commander Dan Brandenstein received before one of the burns in his attempt to rendezvous. NASA took an extra orbit to sort things out and eventually used a solution calculated by a ground computer.
NASA’s Robert Hinson says the genesis of this problem dates to the early 1970s, when programmers were writing code for computers with only 60K of memory. (By the time the shuttle flew, the computers’ memory had increased to 106K. It has since been upgraded to 256K.) As a result of this ceiling, the intermediate results of some calculations could be stored with only limited precision—up to seven significant numbers, for example, instead of 14, double precision. You have a similarly limited accuracy in the number of places that can appear in the window of your hand-held calculator. Computer experts agreed that some results would require double precision and that the calculations for rendezvous should be programmed to use some of each.
Although mixed-mode arithmetic had not been a problem on any previous rendezvous—indeed, the routine was thought to be sufficient for all sets of numbers—the specific set of mixed-mode numbers that the computer tried to crunch in this instance made it want to keep trying. A number of calculations comparing where the shuttle wanted to be with where it could be by executing a certain burn looked equal. Part of the calculation thought the computer had converged. The other part thought it should keep trying. The numbers were so uniquely close together that the algorithm broke down.
It was such a rare situation that NASA did not require IBM to rush to fix it but waited until the next major computer program release, completed in 1993. For that release, programmers changed the entire set of those calculations to accept all double precision numbers.
Not every error discovered in the software is corrected by changing the code. According to John Garman, it’s safer not to fix certain “benign anomalies” once they’ve been discovered, since “you generally introduce a bug for every few you correct.” For this reason, on every shuttle mission the astronauts fly with as set of footnotes to the software, describing various bugs and how to work around them.
These are the bugs they know about. Since the shuttle resumed operations in 1988 following the Challenger accident, only one error that was the result of a coding deficiency slipped through. The crew didn’t notice it during the flight, but analysts at NASA found it by studying the telemetry afterward. It was a benign error; a notice to the crew appeared twice instead of once on their computer screens. But it rattled the programmers. They knew that any error could be dangerous. That this one was insignificant was a matter of luck.
The exhaustive process of scanning code for errors, testing, and simulating continues as the shuttles are fitted with new altimeters and cockpit instruments, upgraded to automatically integrate navigation information from the Global Positioning System, and adapted to dock with the Russian space station Mir. Almost every hardware change requires a software change, and for every software change there are dozens of ways that the comp0ter1 c0uld s#5dc e41010001ej xuhy2 18&89j4.
While it is customary to accept responsibility for published errors, Billy Goodman prefers to lay the blame on his software.