The effects of technical debt are sometimes difficult to demonstrate. Code that is not just buggy, but is systemically bad, is a different level. When code is required to operate under serious constraints, good engineering practices need to be applied in a serious and systematic way.
Modern cars are filled with electronics to keep them running, including coordinating core vehicle services like acceleration and braking. In the mid-2000s, some Toyota drivers began reporting that their car was accelerating without them touching the gas pedal, and in some cases, while they were even stepping on the brake pedal.
Some incidents were attributed to drivers mistakenly pressing the gas pedal, or some minor physical issues in the car, such as floor mats that slid and pressed the gas pedal. Eventually, some of the issues were traced back to the software in the car.
There was a big problem with the software in the car. Actually, there were big problems. Two software experts, Phillip Koopman, and Michael Barr, analyzed the source code, and found errors that ranged from memory corruption, processes that died and disabled safety systems, systems with single points of failure, bad memory handling, and thousands of global variables.
That last one is particularly egregious in my opinion, because some of the other problems they found were bad, but they were the type of problems that are difficult and require specialized knowledge. I’m not suggesting that software engineers who design critical systems shouldn’t have that specialized knowledge — I’m saying that at least the problems were difficult problems.
Global variables are a subject that gets discussed in nearly any freshman-level software engineering course. They’re widely known to be a risky practice among any modern software engineer — the term we use to describe their existence is that they are a “code smell”.
Put another way, global variables may not have robbed the bank, but they were spotted with large duffle bags, speeding away from the scene (presumably in a Toyota). It’s fair to ask what they’re doing there.
For there to be thousands of global variables used in the car’s software, it does not only implicate the software, but the process used to make the software. My sense is that the system was so bad that it was actually impossible to fix in the environment. Producing good work requires having standards, both within the engineering team, and also the organization’s expectations for how they work with the engineering team.
This type of technical debt is unmanaged. It is not the sort of debt that is accrued by an organization that understands why it is there. It is systemic, and because the system was designed inappropriately, it was not possible to make meaningful improvements. One assessment said that the code was unmaintainable, and that because of this, even if “your car has the latest version of the [software] doesn’t mean it is safer necessarily than the older one” 1
To me, the situation evokes the Challenger disaster. The root cause ended up being O-rings that would fail under certain circumstances. In retrospect, the cause of the problem was shockingly simple — Richard Feynman (the noted physicist) was able to reproduce the problem with a glass of ice water. There are signs that engineers in both organizations knew that there were problems, but that the organization as a whole failed to respond in an appropriate way.
Improving a large codebase for an embedded system offers many of it’s own challenges. How would you approach improving a system like this?
For more reading: