Guest Essay
It's not Rocket Science
An updated version of this essay appears in the Nothing but the Truth book.
The loss of the Columbia and Challenger Shuttles were not just technical failures - they were also a result of a culture at NASA that devalued safety. As we look at the Shuttle failures, how can we be sure that labs aren't doing the same thing?
Lessons from the Columbia and Challenger Disasters
- The simple cause
- Budgetary and Schedule Pressures - "Smaller, Faster, Cheaper"
- The Normalization of Deviance - A Broken/Silent Safety Culture
- Is there a Columbia/Challenger in our future?
- References
In a time of crisis and war, the disintegration of the Space Shuttle Columbia on February 1st, 2003, was yet another tragedy in a series of misfortunes that plagued the US. But even this horrible event was quickly overshadowed by the war in Iraq and the subsequent bloody occupation. Since that time, analysis of how and why the Shuttle failed has been lost in the shuffle, drowned about by stories of war, politics, budget shortfalls, and the usual celebrity nonsense that dominates the news. As a result the names of Rick D. Husband, William C. McCool, Michael P. Anderson, David M. Brown, Kalpana Chawla, Laurel Blair Salton Clark, Ilan Romon are not likely to enter the public consciousness.
The Columbia disaster not took the lives of seven individuals who had dedicated their lives to the space program and science, it may have brought an end to the aggressive US pursuit of manned space flight. In light of the problems revealed in NASA and the resources needed to fix it, the entire NASA program may wither on the vine due to lack of funds and political will.
The Columbia Accident Investigation Board released its report on August 26th, 2003, This report is a document of extraordinary analysis and introspection. The investigatory board refused to confine their report to a single, simple failure. Instead, they conducted a thorough analysis of not only the technical flaws, but the organizational flaws that allowed the technical errors to occur. As it turns out, the management culture at NASA was as much to blame for the Shuttle loss as the actual foam strike that occurred 81.7 seconds after liftoff. Reading through the report, you can find disturbing similarities in the culture of NASA to the prevailing culture found in the healthcare industry.
“Many accident investigations make the same mistake in defining causes. They identify the widget that broke or malfunctioned, then locate the person most closely connected with the technical failure: the engineer who miscalculated an analysis, the operator who missed signals or pulled the wrong switches, the supervisor who failed to listen, or the manager who made bad decisions. When causal chains are limited to technical flaws and individual failures, the ensuing responses aimed at preventing a similar event in the future are equally limited: they aim to fix the technical problem and replace or retrain the individual responsible. Such corrections lead to a misguided and potentially disastrous belief that the underlying problem has been solved.” [1]
“It's not rocket science”, used to be a common phrase in our vernacular, as a way to note the amazing accomplishments of our space program -- while decrying the problems here on earth that have simpler solutions and yet are still unsolved. Now, unfortunately, it seems that even our rocket scientists aren't what they used to be. But now when we claim that our industry isn't rocket science, we may be trying to distance ourselves from their failures.
But there are hard lessons to be learned in the Columbia disaster, not only for NASA, but for any industry. We in healthcare would do well to study the reasons for this failure and the implications for patient safety. We ignore it at our peril and the peril of the patients we serve.
The simple cause
On February 1st, 2003, the Space Shuttle Columbia disintegrated in mid-air while attempting to land. The Columbia Accident Investigation Report, released on August 27th of the same year, detailed the specific causes of the accident:
"The physical cause of the loss of Columbia and its crew was a breach in the Thermal Protection System on the leading edge of the left wing, caused by a piece of insulating foam which separated from the left bipod ramp section of the External Tank at 81.7 seconds after launch, and struck the wing in the vicinity of the lower half of Reinforced Carbon-Carbon panel number 8. During re-entry this breach in the Thermal Protection System allowed superheated air to penetrate through the leading edge insulation and progressively melt the aluminum structure of the left wing, resulting in a weakening of the structure until increasing aerodynamic forces caused loss of control, failure of the wing, and break-up of the Orbiter. This breakup occurred in a flight regime in which, given the current design of the Orbiter, there was no possibility for the crew to survive."[2]
Seventeen years earlier, the space Shuttle Challenger blew up during launch due to the failure of O-ring seals on the joints of the booster rockets.
From a simple technical perspective, the foam strike and the O-ring failure share few features. However, the context of the failures share much in common. Most strikingly, NASA was aware of both of these flaws well before the fatal flights of Challenger and Columbia. Before the Challenger explosion, there had been nine previous O-ring failures. Before the Columbia disaster, there had been at least seven previous foam strikes. [3] Yet despite knowledge of these flaws, in both cases, management continued to let the Shuttles fly. In fact, even after concern was raised about the foam strike on the last flight of the Columbia, management continued to believe that there was no threat to the safety of the mission.
Budgetary and Schedule Pressures
From 1992 to 2000, “Smaller, Faster, Cheaper” was the mantra espoused by NASA and its head administrator, Dan Goldin. While politicians hailed the success of the space program, they cut its budget by 40%[4] . Without a clear mandate, NASA had to rob from the Shuttle program to fund other projects like the International Space Station, even as the station required more Shuttle flights to build the station.
Under funding pressure, NASA began out-sourcing much of its work to contractors, and simultaneously began to cut its safety program. It was assumed that safety could be reduced because the contractors would assume the responsibility for safety. Multiple job titles in the safety program were assigned to the same person. The remaining safety program employees found their salaries dependent upon the very programs they were supposed to oversee, leading to inevitable conflict of interest.
In public, NASA officials declared over and over again the importance of safety. However, the board found that “personnel cutbacks sent other signals. Streamlining and downsizing, which scarcely go unnoticed by employees, convey a message that efficiency is an important goal….When paired with the ‘faster, better, cheaper‘ NASA motto of the 1990s and cuts that dramatically decreased safety personnel, efficiency becomes a strong signal and safety a weak one.” [5]
Sally Ride, former astronaut and the first woman in space, participated in the Challenger and Columbia Accident investigations. About the budget pressures at NASA, she had this so say: “’Faster, better, cheaper’ when applied to the human space program, was not a productive concept. It was a false economy. Its very difficult to have all three simultaneously. Pick your favorite two. With human space flight, you’d better add the word ‘safety’ in there, too…” [6]
Because of funding problems and the political pressure associated with funding, NASA had to continually demonstrate that it was delivering value for the investment. Launches became a concrete way of showing Congress that the billions spent on the space program were worthwhile. “NASA was transformed from a research and development agency to more of a business, with schedules, production pressures, deadlines, and cost efficiency goals elevated to the level of technical innovation and safety goals.“ [7]
The top levels of NASA soon began to be occupied by business managers instead of technical engineers. The Shuttle had once been termed a “developmental” vehicle, which meant that it could fly but that it wasn’t quite ready for prime time. Under the new management culture, the Shuttle became an “operational” vehicle, which meant the goal was to squeeze out as much operation (i.e. launches) as possible. As a result, small flaws like o-ring erosion and foam hits were not seen as serious dangers to the Shuttle flight and were tolerated as part of routine operations. Since these things weren’t “serious“ flaws, and fixing them would delay flights, the Shuttle would fly with them. “Scarce resources went to problems that were defined as more serious, rather than to foam strikes or O-ring erosion” [8] Management fully intended to fix the minor flaws eventually, but only after the launch schedules were met.
Again, Sally Ride provides insight: “….if upper management is going ‘faster, better, cheaper,’ that percolates down, and it puts the emphasis on meeting schedules and improving the way that you do things and on cost. And over the years, it provides the impression that budget and schedule are the most important things.” [9]
It’s not rocket Science…
- Has healthcare entered an era where efficiency and cost are valued more than quality?
- Has management traded technical expertise for business training, rather than adding business training to technical training, in an effort to improve efficiency and cost?
- As laboratories outsource method validation responsibilities and other tasks, do we assume that our vendors are providing us with the quality we need? If our vendors don’t, do we maintain the skills to tell the difference and perform the necessary services ourselves?
- Is the approach for “Doing more with Less” in the laboratory any different from the “faster, cheaper, better” mantra? Do the physicians want tests faster or better? Does the healthcare organization want the laboratory to operate cheaper or better?
- Do we know how the cost savings from "faster and cheaper" compare to the cost of rework due to poor quality test results?
- Have medical laboratory scientists been replaced by less skilled technical personnel? Do the upper levels of management understand the quality control issues in the lab, or just know enough to speak the words?
The Normalization of Deviance
Both the Challenger and Columbia accident investigation boards asked similar questions: Why did NASA continue to fly the Shuttle with known foam debris problems that dated back years before the fatal Columbia launch? And why did NASA continue to fly the Shuttle with known O-ring erosion problems that dated back years before the Challenger launch?
The answer is that these errors had been “normalized” over many occurrences until managers and even the engineers themselves began to believe that these flaws were routine and acceptable. Diane Vaughan, in her exhaustive book, The Challenger Launch Decision (University of Chicago Press, 1996 ), coined a telling phrase for this behavior: the “Normalization of Deviance.”
When the Shuttle was originally designed, no allowance was made for the possibility that foam debris could fall off the main tank and strike the wing. Nor was any allowance made for the possibility that in cold temperatures, the O-rings on the booster rockets would shrink and erode. When these events were first experienced, the design principles were therefore violated, “but in both cases after the first incident the engineering analysis concluded that the design could tolerate the damage. These engineers decided to implement a temporary fix and/or accept the risk, and fly. For both O-rings and foam, that first decision was a turning point. It established a precedent for accepting, rather than eliminating, these technical deviations.” [10]
As further foam strikes occurred, engineers now accepted those problems as expected behavior of the Shuttle. The fact that the Shuttle kept flying was seen as further evidence that these errors were acceptable. If a foam strike occurred during a flight, that just proved it wasn’t a serious danger, since it didn’t bring down the Shuttle. So the errors were no longer even seen as errors. They had become “normalized” - a foam strike was now considered a normal part of a Shuttle lift-off. Over time, larger and larger foam strikes were tolerated, since previous strikes hadn’t caused a problem. So when the fatal piece of foam struck the left wing of the Colombia, it was dismissed as a minor issue that would be repaired once the Shuttle landed, despite the fact that it was one of the largest pieces yet to strike a Shuttle.
In effect, the normalization of deviance broke the safety culture at NASA. They fell down the slippery slope, tolerating more and more errors, accepting more and more risk. If everything was tolerable, how did one object? Management began to demand proof that errors would bring down a Shuttle, instead of making the proper reverse demand: show proof that the Shuttle has NOT been harmed. Without the resources to test and prove that the Shuttle had indeed been harmed by the last foam strike, the remaining safety engineers at NASA were effectively silenced.
During the Challenger investigation, Richard Feynman, the Nobel laureate, famously compared the launching of a Shuttle with a game of Russian Roullette. While that overstated the case, it was not far off the mark. Managers at NASA deliberately took a risk. They believed the risk was quite low or zero, but they had not even done the calculations to know how big the risk was. They pushed and pushed the limits without understanding what or where the limits really were. By relentlessly pushing the envelope, tragedy was almost inevitable.
It’s not rocket science…
- Are we tolerating more errors in the laboratory because we know test results can’t be perfect?
- Has repeating the controls become the common response to any out-of-control event?
- Are control limits being artificially “widened” by use of bottle values or peer-group SDs?
- Have we entered an era where analytical errors are no longer considered important? Has the emphasis on pre-analytical and post-analytical errors made us assume that we no longer have to worry about analytical errors?
- Does the ISO move to change error terminology to “uncertainty” obscure the concern and importance of laboratory errors?
- Isn't it easier to accept being "uncertain" than being in error?
Lessons for the Laboratory: Is there a Columbia/Challenger in our future?
“It is our view that complex systems almost always fail in complex ways, and we believe it would be wrong to reduce the complexities and weaknesses to some simple explanation. Too often, accident investigations blame a failure only on the last step in a complex process, when a more comprehensive understanding of that process could reveal that earlier steps might be equally or even more culpable. In this board’s opinion, unless the technical, organizational and cultural recommendations made in this report are implemented, little will have been accomplished to lessen the chance that another accident will follow.” [11]
We in healthcare can pretend that what happened at NASA can't possibly happen to us. But the healthcare industry shares one core characteristic with NASA: safety. Although it often goes unstated or unsaid, the primary concern with both healthcare and manned space flight is safety. The very root of medicine, as codified in the ancient Hippocratic Oath, is to Do No Harm. Likewise, when President Kennedy set the supreme goal for NASA, it wasn't just to send a man to the moon, it was to send a man to the moon and return him home safely.
This core concern for safety makes both NASA and healthcare unique among industries. Budgets and deadlines dominate every business, inescapably so, but the failure of most businesses is financial, not fatal. As we have forced the space program and healthcare into the usual business model, we squeezed out safety. Other businesses can “push the envelope” and fail without serious consequence - a product or service doesn't sell, they go out of business, employees and sometimes CEOs lose their jobs. Pushing the envelope in our field can maim or kill people.
The fact still remains that laboratory medicine is not “rocket science,” but the technical sophistication and complexity of the instrumentation and testing processes are ever-increasing. And mounting pressure has been applied to laboratory medicine to produce cheaper and faster results in the guise of satisfying physician demands and patient needs. Shouldn’t correct results be a higher priority?
Reading through the Accident Report, there are startling echoes of the cultural failures at NASA and the current trends and attitudes in healthcare. Surely the stakes for healthcare are just as high as for NASA. The Space Shuttle puts dozens of people into space in a year, while millions of people go through the healthcare system every day. As the IOM report warned, somewhere between 40,000 and 98,000 deaths can be attributed to the failures of the healthcare system.
The Space Shuttle, as of 2003, had flown 112 missions. Two of those missions ended catastrophically. A gross calculation based on just those two numbers reveals a 1.7% error rate, or a Six Sigma metric of 3.7. We in the laboratory already know that some of our processes have Sigma metrics well below 3.7. That fact alone should give us pause.
In healthcare, our failures are not so spectacular as Shuttle explosions, but they are much more prevalent and frequent. These failures occur over time, in circumstances that obscure and insulate the possible root causes of the failures, that distribute over large patient populations, and spread out the responsibilities across multiple healthcare professions. But undeniably, our failures impact and affect many, many people. We would do well to learn from the failures of our rocket scientists and make sure that our own practices don’t repeat those mistakes.
References
- Columbia Accident Investigation Board (hereafter CAIB) Report, Volume I, August 2003, p.177. Available online at http://caib.nasa.gov
- CAIB Report, Executive Summary.
- General Donald Kutyna, quoted in New York Times New Analysis, 8/27/03, p.16.
- News Analysis, David E. Sanger, New York Times, 8/27/03, page 1.
- CAIB report, p.199.
- Interview, New York Times, 8/26/03, p.F2.
- Howard E. McCurdy, “The Decay of NASA’s Technical Culture” Space Policy, November 1989, pp.301-10, referenced on page 198 of theCAIB Report.
- CAIB report, p.200.
- Interview, New York Times, 8/26/03, p.F2.
- CAIB report, p.196.
- Ibid,, Board Statement.