Why Do Systems Work?
There are many sources and literature about how systems fail. There are crash reports, there are detailed forensic analyses of famous disasters involving complex systems such as nuclear reactors, ships, rockets and aircrafts. There are millions of words written about why failures happen and how to prevent them. There seems to be a science of failure. Of course I didn’t want to be left out of the party, so I did my bit in that book of mine, where I wrote1 that failure is never atomic and that it requires a path towards system-level calamity, using the famous analogy of the aligning holes of Gruyère cheese. Succinctly, it always takes more than one single, indivisible step for something complex to fail wholly, except some external, catastrophic event.
But there isn’t much written, though, on why systems work. And yes, I am aware of the initial absurdity of the question. It is like asking medical science to write about “why are people healthy”. Granted, it makes more sense to write about diseases—the off-nominal situations—and not the normal scenarios. The same way my wife does not remember when I have done things right but vividly recalls every time I have screwed up.
Unlike living organisms—including wives with good memory—and medicine, the technical artifacts we make are not the result of selfless evolution and ethics but, on the contrary, subject to market forces, production imperfections and some other unholy factors moved by selfish principles all the way from ideation stage. Asking why artificial things work is not so absurd, as we shall see.
When I got my first job back in 2002 as an “all rounder” in a now-extinct company which made automation and access control systems, I was fortunate to witness the full life cycle of this electronic board which was in charge of controlling the access of personnel to those old phone service boards which used to be spread around cities. As a young engineer wannabe, I managed to see this device going from a scribble on a whiteboard, all the way to block diagrams, schematic drawings, embedded software, prototypes and—finally—the actual thing sitting on a table. After the team fixed some initial mistakes here and there, the board software finally flashed and a LED started blinking indicating “self check good”. By then, a strange feeling invaded me. Just as if something magical had just happened. Sure, the board was not a Boeing 737 in terms of complexity, but it had a fair amount of density, with parts and components coming from different vendors across the world, from USA to Asia, many different interfaces and protocols, and this sequence of mysterious, human-defined instructions—in short, software—dictating the behavior of the whole thing. I had heard and witnessed all the heated meetings where different engineers argued about how they would have designed or implemented some parts this way or another. I had seen the budget oscillating, expanding and shrinking. And yet it worked. And it didn’t only work on the “well-behaved” lab benchtop environment, it also did in the field: the company managed to install this new controller in hundreds, if not thousands of phone service boards across the city.
That simple question stayed in my head, a question which has been chasing me after, alas, 20 years: how the hell did that work? And the same question assaults me every time I am put in front of any complex system which involves a sensitive amount of components. How on earth systems with hundreds, thousands or millions of parts manage to turn on and run? I feel, as engineers, we get too used to the fact things work, and we do not stop to appreciate how the probabilities are mysteriously playing on our side. What is more, we tend to feed our egos with what is ultimately a fair dose of luck. I somehow refuse to get unemotionally accustomed to see complex things work, although I can nervously enjoy it as I just won’t stop thinking about the underlying probabilities.
I mean, sure, I know why systems are supposed to work in the design domain where everything is ideal and perfect, where I control most of the variables and where everything abides to deterministic physics and circuit theory laws. But complex systems working in production is a different story. Think about it. Myriads of components, all with their own life span, their own variances, their own bathtub curves, their own intrinsic design imperfections, tolerances, bugs, stochastic noises. A Boeing 737—with over 10,000 units produced worldwide—has an average of 600,000 (six hundred thousand) parts23, the Airbus A380 4 million4. If we can assume no perfect system exists in the world, we can also say the same for any subsystem, any sub subsystem and all the way down to the most elementary part. Nothing is perfect. So, how can function emerge from a collection of imperfect elements hooked together?
I am not the only one asking this—a priori—silly question. In this famous blog post, Richard I. Cook (MD), a systems researcher from the University of Chicago, tackles the issue of system failure from more or less the usual perspective. But interestingly, in this video, he asks the same question I am asking here, although unfortunately he leaves the question somehow unanswered. Still, he points a few relevant things out:
The surprise is not that there are so many accidents. The surprise is that there are so few.
The real world often offers surprises: the world is not well-behaved.
Even so, a lot of operational settings achieve success. Because of or in spite of designs?
Then Dr. Cook goes to define a divide between what he calls the system-as-imagined vs system-as-found. In short, the schism between how we imagine—design—things and how the actual instances of our designs evolve out in the real settings.
When we talk about why systems do or do not work, we can’t really leave human errors out of the equation. Although we live in an “age of autonomy” of sorts, we humans remain a critical piece of the puzzle. Therefore, human errors are very relevant when it comes to observing complex systems failure. For example, in the early days of flight, approximately 80 percent of accidents were caused by the machine and 20 percent were caused by human error. Today that statistic has reversed. Approximately 80 percent of airplane accidents are due to human error (pilots, air traffic controllers, mechanics, etc.) and 20 percent are due to machine failures5. These statistics opens a few more questions:
Have we humans become sloppier operators in time?
Have systems become more complicated to operate?
Have systems become more intrinsically reliable?
Nothing really indicates we have been especially worsening as operators, but systems have indeed become more complicated to operate, and for sure there is an intrinsic increase of reliability with better materials, better tools, better software.
As said, no system in the history of humanity has been made perfect. There is always a non-zero amount of flaws on every system ever made, or to be made. Every single plane you and I have boarded or will ever board has a non-zero amount of software bugs. Same for any of the 441 nuclear reactors which are currently operating in the world. The question is how relevant are those unavoidable flaws in order to bring the whole system down at once. All the systems we use, we fly on and we operate, show errors at almost any given time. Every existing vehicle out there is having at this precise moment checksum mismatches, data corruptions, bit flips, material cracks, bolts getting infinitesimally loose. Be it planes, cars, nuclear reactors, hair dryers. Some of those events may never get reported, and perhaps even noticed nor logged, and may be obliviously corrected by routine maintenance. And for external factors which can in fact affect the system as a whole and bring it to shambles, we luckily tend to learn: pilots once realized flying through storms is a no no. Transatlantic ships captains noted at some point icebergs are worth keeping an eye on.
I have criticized complexity ad nauseam. “Keep it simple, stupid”, and other platitudes. Sure, but complexity happens. I mean, it is not good to inflate it just because, but complexity is something you can’t fully avoid. You read everywhere that complexity is the enemy of reliability. But is it6? The Bleriot XI was systemically very simple compared to a Boeing 767, but is it more reliable? You may think we are comparing apples and oranges, correctly pointing out that the difference of stakes is abysmal: a crash of a Bleriot can kill way less people than a crash of a 767. Ok let me fix that a bit, at least in terms of potential casualties: let’s take a cargo version of the B767, which carries a crew of 2, and again compare it with the 2-seats version of the Bleriot. Again, which one is more complex? That’s an easy one. But, which one is more reliable? Which one would you choose for taking a flight on a foggy night?
Coming back to Dr. Cook’s valid question on how come we don’t see more accidents, I would rephrase this a bit differently. Systems are abundant of accidents. Although very small ones. Systems coexist with imperfections and errors. They continuously sustain an organic amount of small internal mishaps and “benign” off-nominal behaviors they manage to live with. As we equip the artifacts we make with more parts and components, the path to systemic, global failure gets longer, more intricate, and less combinatorially possible. And we also incorporate learnings from big, loud failures of others so they don’t happen to us: every time a complex system fails badly, we all take notes. This loop has allowed for less Titanics and Chernobyls.
Is it that things work because they are complex, then? Is complexity a valid protection for failure? No, here comes the third dimension which compensates for anyone thinking that over engineering is the key to protect themselves from a cataclysm. More complexity exerts more pressure into whoever has to deal with the more states and transitions complexity brings, either a human operator or an algorithm. Any potential reliability increase brought by complexity is compensated by an increased risk of operational error. You could get to fly the Bleriot XI or the Wright Flyer after a few tries, but good luck trying to fly an airliner. Again: which one is more reliable? The answer greatly depends on who drives.
Ok, so why do systems work, then? It largely depends on the zoom level when you look. From an external observer, systems work because their internal faults are not cooperating to “win” the disaster lottery. Under the magnifier, systems are just failing all the time7.
See chapter 2, section 2.10.
https://www.boeing.com/farnborough2014/pdf/BCA/fct%20-737%20Family%20Facts.pdf
https://investors.boeing.com/investors/fact-sheets/default.aspx
https://www.airbus.com/sites/g/files/jlcbta136/files/2021-12/EN-Airbus-A380-Facts-and-Figures-December-2021_0.pdf
https://www.boeing.com/commercial/aeromagazine/articles/qtr_2_07/article_03_2.html
https://ieeexplore.ieee.org/document/4273892
See our world as perhaps the biggest “human-made” (bear with me) system we can think of. For an hypothetical observer sitting on the Moon, the world, more or less, works. Now, shit—even very serious shit—is happening all the time.