Duplicate code

Duplicate code is one form of technical form of technical debt that adds problems where you might not expect.

As a simple example, let’s take email address validation. Email address is the type of thing that seems simple, but is actually devilishly complex. As an example, here is a validator that conforms very strictly to the actual RFC, and is almost 7k of code, including explanations so you can figure out what it is doing.

I happen to know about this regex because it was written by a friend and coworker, Sandeep Tamhankar. He wrote it years ago just to prove a point. In the real world, almost no one ever uses Sandeep’s regex. They’ll invent their own version to validate email addresses, or they’ll avoid regular expressions entirely. The point is that there are a huge number of ways to do something very basic, like deciding if a piece of text is an email address or not.

This brings me to something that happened the other day. I was doing a push of some new code, and it required a migration of some data into a new model. Everything was going well, until the script stopped. The script was idempotent, so I knew it was safe to run again. As a first shot at solving the problem, I just re-ran it. No luck.

Now, these things happen. No problem. I dug into the migration, looked around at the code, and after a while, I found out that the new code to store the data was rejecting the record because it said that a field that should have been an email wasn’t.

This was particularly strange, because the data was coming from our own system, not an import from a third party. We already knew these were email addresses. There should have been no problem. But sure enough, there was an error staring me in the face saying “nope, it’s not an email”

I took a look at the record, and there definitely was something unusual. Over the past few years, there has been a proliferation in TLD (top-level domains) in use. We used to have good ol’ .com, .net, .org, .gov, .edu, and a bunch of 2-letter country codes. Now we have a lot more.

It turned out that in our system, there was more than one way to validate an email address. Some email addresses had made it into our system past one type of check, but was causing trouble as we tried to move it to another part of the system, because a different type of check was being done.

This highlights a risk from technical debt caused by code duplication. Nearly every software engineer has heard the expression “don’t repeat yourself” or DRY. Obviously our code failed that test (although this instance was quickly fixed so that we could finish the migration).

This is the impact of non-DRY code, and captures the link to technical debt. Something as simple as validating an email address stopped us from deploying new features and bugfixes, because we had to investigate and resolve something that happened because the second person wasn’t familiar with enough of the code to know that someone else had already solved the problem, and they could reuse the first solution.

What had happened if this wasn’t as simple as an email address? What if this was a complex system process?

The lesson? Software engineers should always be mindful of duplicated functionality, and when it is found, refactor it out into common modules whenever possible. The code ends up being cleaner, more maintainable, and increases uptime.

Reddit’s Technical Debt

Reddit is having some serious problems. Freedom of speech vs hate groups vs plain old jerks vs making some money. The (former as of last week) CEO, Ellen Pao, had a tough challenge ahead of her when she joined, and I respect her for trying to do difficult things.

The final straw for many redditors (aka: Reddit users/volunteer moderators) was when a Reddit employee was let go. This particular Reddit employee was well-liked, and many redditors felt it was the breaking point. They shut down some of the most popular parts of the site and ended up grabbing a lot of press.

It’s worth pointing out that that there was already an unhealthy relationship between the executives in charge of Reddit (the corporate structure) and the volunteer moderators who ran the site for love, not money.

To be fair, before this strained relationship, Reddit originally provided a good place for a community to form. Facebook is the face of “social media” today, but there are an enormous number of different online communities, and there always have been.

Communications technology has always been about the idea of making content available to others and providing mechanisms for people to discuss that content. Before Facebook and Friendster, there were forums, and before that there were BBSs.

There are a few things needed for a good electronic community. Let’s use Facebook as an example.

One component is the core team — Mark Zuckerberg, Sheryl Sandberg, and the thousands of people at Facebook. Another component is the visitors to the community — you, me, and the +1 billion people that show up every month. Finally, there is the platform itself. There is an enormous amount of infrastructure that combines to make the thing we call “Facebook” work.

This infrastructure part (both hardware and software) isn’t always discussed. I think part of it has to do with the fact that it is hugely complicated. The plumbing required to manage the data for a billion people each month is quite literally beyond the scope of any one person’s ability to understand it.

(I think there’s a good argument that the infrastructure of Facebook is as complex as a human body, DNA and psychology included. If you think that’s exaggerated, I suggest you consider that a lot of the added difficulty we have with bodies comes from us having to reverse engineer the blueprints, where with Facebook, the blueprints are all there, recently created by people, using language that many of us understand.)

Again, this stuff isn’t frequently discussed. 9 out of 10 conversations about Facebook aren’t about their code or their servers, but they’re an essential part of that conversation.

Reddit has a big infrastructure too, but much like Facebook, nearly all conversations about Reddit aren’t about their code or their servers. Reddit was started back in 2005, and if we make some reasonable assumptions about coding practices, we can assume that there is a serious chunk of technical debt after a decade of work.

Part of Reddit’s infrastructure that (presumably) contains technical debt are community moderation tools. These features are what make it easier for a small number of moderators to manage a large number of visitors. One example is an auto-moderator bot that moderators can train to enforce community policies.

During the recent dust up at Reddit, Ellen Pao posted about the relationship with moderators, which I’ve screenshotted below.

Ellen Pao on Reddit's Technical Debt

There were several other comments in other threads about problems with the auto-moderator, where it was enforcing rules at undesirable moments, giving the illusion that the core executive team and moderators were abusing their power to control content on the site. Even though it may have been automatic systems inadvertently changing content with no human intervention — the humans were blamed.

Given Ellen’s comments about a monolithic infrastructure, it is reasonable to assume that there was so much technical debt that they were unable to fix some of these infrastructure problems in a timeframe that would keep the community happy.

Would Reddit be a Garden of Eden if it weren’t for technical debt? Probably not. There were genuine hate groups on the site posting offensive material. There were some awful examples of behavior — people who would drop in on mental health communities and encourage people to kill themselves. There were people who would find high-profile media celebrities and share personal information. The world has jerks, and some parts of Reddit were a home for these jerks.

Still, it’s worth wondering — what if there were better tools for finding hateful material and hiding it? What if there were better tools for banning abusive users from the community? What if the system had been easy enough to change to keep up with the way people were using it?

As they say, the medium is the message. What happens when the medium is a flawed system?

Case Study: Toyota and Unintended Acceleration

The effects of technical debt are sometimes difficult to demonstrate. Code that is not just buggy, but is systemically bad, is a different level. When code is required to operate under serious constraints, good engineering practices need to be applied in a serious and systematic way.

Modern cars are filled with electronics to keep them running, including coordinating core vehicle services like acceleration and braking. In the mid-2000s, some Toyota drivers began reporting that their car was accelerating without them touching the gas pedal, and in some cases, while they were even stepping on the brake pedal.

Some incidents were attributed to drivers mistakenly pressing the gas pedal, or some minor physical issues in the car, such as floor mats that slid and pressed the gas pedal. Eventually, some of the issues were traced back to the software in the car.

There was a big problem with the software in the car. Actually, there were big problems. Two software experts, Phillip Koopman, and Michael Barr, analyzed the source code, and found errors that ranged from memory corruption, processes that died and disabled safety systems, systems with single points of failure, bad memory handling, and thousands of global variables.

That last one is particularly egregious in my opinion, because some of the other problems they found were bad, but they were the type of problems that are difficult and require specialized knowledge. I’m not suggesting that software engineers who design critical systems shouldn’t have that specialized knowledge — I’m saying that at least the problems were difficult problems.

Global variables are a subject that gets discussed in nearly any freshman-level software engineering course. They’re widely known to be a risky practice among any modern software engineer — the term we use to describe their existence is that they are a “code smell”.

Put another way, global variables may not have robbed the bank, but they were spotted with large duffle bags, speeding away from the scene (presumably in a Toyota). It’s fair to ask what they’re doing there.

For there to be thousands of global variables used in the car’s software, it does not only implicate the software, but the process used to make the software. My sense is that the system was so bad that it was actually impossible to fix in the environment. Producing good work requires having standards, both within the engineering team, and also the organization’s expectations for how they work with the engineering team.

This type of technical debt is unmanaged. It is not the sort of debt that is accrued by an organization that understands why it is there. It is systemic, and because the system was designed inappropriately, it was not possible to make meaningful improvements. One assessment said that the code was unmaintainable, and that because of this, even if “your car has the latest version of the [software] doesn’t mean it is safer necessarily than the older one” 1

To me, the situation evokes the Challenger disaster. The root cause ended up being O-rings that would fail under certain circumstances. In retrospect, the cause of the problem was shockingly simple — Richard Feynman (the noted physicist) was able to reproduce the problem with a glass of ice water. There are signs that engineers in both organizations knew that there were problems, but that the organization as a whole failed to respond in an appropriate way.

Improving a large codebase for an embedded system offers many of it’s own challenges. How would you approach improving a system like this?

For more reading: