Ending technical debt via code reviews

A quick but valuable technique for keeping technical debt low:

Lots of teams incorporate code review into their process. Some use Github’s Pull Request model, others use Gerrit. There are many options, and which one you use is a question for your team, as long as you’re doing code reviews. Code reviews are great! They help with QA, as well as creating teaching opportunities between team members.

Some teams do very straightforward reviews — each engineer looks at the code and gives feedback according to their own sense of what is good and bad.

This is fine, but there is an opportunity to raise the bar on code reviews.

One technique for making code reviews more valuable is to have a discussion as a team about what you should be focusing on. This doesn’t mean each engineer should give up on their particular flavor of how they look at code, but rather that they should include some of the team goals.

By having shared goals in code reviews, they act as a hook to guide a codebase in a particular direction. Slowly increasing logging, improving performance issues, or improving scalability are the sort of things that are sometimes hard to do in a single iteration or two, but by keeping them in mind during code reviews, it will help direct the growth of the codebase over time.

Duplicate code

Duplicate code is one form of technical form of technical debt that adds problems where you might not expect.

As a simple example, let’s take email address validation. Email address is the type of thing that seems simple, but is actually devilishly complex. As an example, here is a validator that conforms very strictly to the actual RFC, and is almost 7k of code, including explanations so you can figure out what it is doing.

I happen to know about this regex because it was written by a friend and coworker, Sandeep Tamhankar. He wrote it years ago just to prove a point. In the real world, almost no one ever uses Sandeep’s regex. They’ll invent their own version to validate email addresses, or they’ll avoid regular expressions entirely. The point is that there are a huge number of ways to do something very basic, like deciding if a piece of text is an email address or not.

This brings me to something that happened the other day. I was doing a push of some new code, and it required a migration of some data into a new model. Everything was going well, until the script stopped. The script was idempotent, so I knew it was safe to run again. As a first shot at solving the problem, I just re-ran it. No luck.

Now, these things happen. No problem. I dug into the migration, looked around at the code, and after a while, I found out that the new code to store the data was rejecting the record because it said that a field that should have been an email wasn’t.

This was particularly strange, because the data was coming from our own system, not an import from a third party. We already knew these were email addresses. There should have been no problem. But sure enough, there was an error staring me in the face saying “nope, it’s not an email”

I took a look at the record, and there definitely was something unusual. Over the past few years, there has been a proliferation in TLD (top-level domains) in use. We used to have good ol’ .com, .net, .org, .gov, .edu, and a bunch of 2-letter country codes. Now we have a lot more.

It turned out that in our system, there was more than one way to validate an email address. Some email addresses had made it into our system past one type of check, but was causing trouble as we tried to move it to another part of the system, because a different type of check was being done.

This highlights a risk from technical debt caused by code duplication. Nearly every software engineer has heard the expression “don’t repeat yourself” or DRY. Obviously our code failed that test (although this instance was quickly fixed so that we could finish the migration).

This is the impact of non-DRY code, and captures the link to technical debt. Something as simple as validating an email address stopped us from deploying new features and bugfixes, because we had to investigate and resolve something that happened because the second person wasn’t familiar with enough of the code to know that someone else had already solved the problem, and they could reuse the first solution.

What had happened if this wasn’t as simple as an email address? What if this was a complex system process?

The lesson? Software engineers should always be mindful of duplicated functionality, and when it is found, refactor it out into common modules whenever possible. The code ends up being cleaner, more maintainable, and increases uptime.


On July 8, 2015, on an otherwise ordinary day, all United Airlines flights in the US were grounded, the NYSE computers crashed, and WSJ.com was down.

The thing is — it was an ordinary day.  Some computers crashed. Except they happen to be major computer systems, and it turns out they crash fairly often relative to their importance in our lives. One secret of the modern world is that technology lasts a lot longer than we give it credit for. Large chunks of the modern world still run using code that is older than Mark Zuckerberg.

Zeynep Tufekci wrote about July 8, 2015, and in a very eloquent way, having had hands-on experience working on these pieces of software that have been running for decades:

The big problem we face isn’t coordinated cyber-terrorism, it’s that software sucks. Software sucks for many reasons, all of which go deep, are entangled, and expensive to fix

It’s that simple — the modern world is tangled. Have you ever untangled a string of holiday lights? Imagine that, except there are about 100 strands all connected, tangled, and you have to untangle them without any lights going off.

This is why code stays online — a system gets written, reaches a reasonable level of stability, then gets layers build on top of it. Eventually, so many layers are running on top of it that no one is willing to touch the underlying system, because it could break everything that depends on it. After a long enough time, many of the original authors may actually have not just left the company, but actually passed away.

This is the complex modern world we live in. Don’t unplug those lights. Why the Great Glitch of July 8th Should Scare You by Zeynep Tufekci.

Choose Boring Technology

I lifted the name of this post from another blog post that I think did an excellent job of addressing the subject of tool choice. Think of this as a cover song.

New needs constantly emerge during software development. Taking an app from being installed on a single server, to handling large amounts of traffic or data, to adding data mining functionality — every new business challenge brings technical changes.

Historically, different companies have had similar technology needs at the same time, and tools show up in order to meet these needs. Typically, one team encounters a problem and builds a tool, then releases it to the public where other teams begin using it.

Actually, this process tends to happen multiple times in parallel. Today, jQuery is a de-facto web framework, but Prototype, MooTools, Dojo, YUI, and many others were created around the same time to address the increasing usage of Javascript in web pages. Any new technical challenge is likely to have many tools that address it.

Choosing new tools is an inflection point when it comes to technical debt. When Javascript libraries first showed up, it would have been remarkably hard to know that jQuery would be the winner. Reliably picking the longest-lasting tool from a new set of options should be thought of as being so hard that we should assume it is impractical.

Picking the wrong tool has serious implications for the pace of work that an engineering team is able to complete. The more people using a piece of technology, the easier it is to learn about, work with, and hire people who have skills working with it. The harder it is to learn, the slower you can work with it, and you’ll have more failures, and need to make more compromises in hiring new engineers.

These are all serious challenges that need to be dealt with, yet I’ve just told you that it is effectively impossible to pick winners out of tools that exist to solve leading-edge problems. Perhaps a different mindset is needed.

Instead of attempting to pick new tools, another approach is to assume that some new tools will need to be brought into the technology stack over time, that some of the choices will not be optimal. In this case, the goal is to ensure that this is done in a manageable way.

An engineer by the name of Dan McKinley spent over 6 years at Etsy. Etsy is a company known for having a high-performing engineering department, and this type of performance comes not only from the code, but from the philosophies of the team.

Dan has talked about one philosophy that comes from this culture — the concept of “innovation tokens”. The idea is that as you choose new pieces of technology to incorporate into a project, some pieces of technology cost you an innovation token.

Every team has a natural number of innovation tokens. Dan estimates that you should start by getting three tokens, and only getting extra tokens when the tech stack is sufficiently stable.

The way a token gets spent is when a decision is made to use a new and interesting piece of technology, instead of something that is well-known.

An example of this is database choice. At the moment, there are many databases on the market: PostgreSQL, MongoDB, CouchDB, Redis, Riak, and many others. Many of them are interesting, but if you want a truly reliable piece of technology, you choose MySQL. MySQL was released in 1995 and has been battle-tested. Everyone knows how to use it. It rarely fails, and when it does, there are standard procedures for recovering from those failures. Configurations and performance are widely known and predictable.

MySQL is not perfect, however, and the other databases exist to solve certain problems better than they could be solved with MySQL. However, they lack the battle testing and broad knowledge base that MySQL has. They have risk, and that risk will eventually show up with poor performance requiring unexpected changes to the code, slowing down feature development.

To choose a non-MySQL database is to spend an innovation token. By spending that token, you acknowledge that risk exists, and that there is a natural amount of budget in the engineering plan to handle unexpected situations, and that you are putting a limit on that budget.

One interesting social aspect to this is that the more rapidly technology is changing, the more likely people will want to use new tools, and that’s when it requires holding strong and not chasing a new trend for technology’s sake. Making a budget is easy, sticking to it is hard, but that’s how you stay out of debt.

Reddit’s Technical Debt

Reddit is having some serious problems. Freedom of speech vs hate groups vs plain old jerks vs making some money. The (former as of last week) CEO, Ellen Pao, had a tough challenge ahead of her when she joined, and I respect her for trying to do difficult things.

The final straw for many redditors (aka: Reddit users/volunteer moderators) was when a Reddit employee was let go. This particular Reddit employee was well-liked, and many redditors felt it was the breaking point. They shut down some of the most popular parts of the site and ended up grabbing a lot of press.

It’s worth pointing out that that there was already an unhealthy relationship between the executives in charge of Reddit (the corporate structure) and the volunteer moderators who ran the site for love, not money.

To be fair, before this strained relationship, Reddit originally provided a good place for a community to form. Facebook is the face of “social media” today, but there are an enormous number of different online communities, and there always have been.

Communications technology has always been about the idea of making content available to others and providing mechanisms for people to discuss that content. Before Facebook and Friendster, there were forums, and before that there were BBSs.

There are a few things needed for a good electronic community. Let’s use Facebook as an example.

One component is the core team — Mark Zuckerberg, Sheryl Sandberg, and the thousands of people at Facebook. Another component is the visitors to the community — you, me, and the +1 billion people that show up every month. Finally, there is the platform itself. There is an enormous amount of infrastructure that combines to make the thing we call “Facebook” work.

This infrastructure part (both hardware and software) isn’t always discussed. I think part of it has to do with the fact that it is hugely complicated. The plumbing required to manage the data for a billion people each month is quite literally beyond the scope of any one person’s ability to understand it.

(I think there’s a good argument that the infrastructure of Facebook is as complex as a human body, DNA and psychology included. If you think that’s exaggerated, I suggest you consider that a lot of the added difficulty we have with bodies comes from us having to reverse engineer the blueprints, where with Facebook, the blueprints are all there, recently created by people, using language that many of us understand.)

Again, this stuff isn’t frequently discussed. 9 out of 10 conversations about Facebook aren’t about their code or their servers, but they’re an essential part of that conversation.

Reddit has a big infrastructure too, but much like Facebook, nearly all conversations about Reddit aren’t about their code or their servers. Reddit was started back in 2005, and if we make some reasonable assumptions about coding practices, we can assume that there is a serious chunk of technical debt after a decade of work.

Part of Reddit’s infrastructure that (presumably) contains technical debt are community moderation tools. These features are what make it easier for a small number of moderators to manage a large number of visitors. One example is an auto-moderator bot that moderators can train to enforce community policies.

During the recent dust up at Reddit, Ellen Pao posted about the relationship with moderators, which I’ve screenshotted below.

Ellen Pao on Reddit's Technical Debt

There were several other comments in other threads about problems with the auto-moderator, where it was enforcing rules at undesirable moments, giving the illusion that the core executive team and moderators were abusing their power to control content on the site. Even though it may have been automatic systems inadvertently changing content with no human intervention — the humans were blamed.

Given Ellen’s comments about a monolithic infrastructure, it is reasonable to assume that there was so much technical debt that they were unable to fix some of these infrastructure problems in a timeframe that would keep the community happy.

Would Reddit be a Garden of Eden if it weren’t for technical debt? Probably not. There were genuine hate groups on the site posting offensive material. There were some awful examples of behavior — people who would drop in on mental health communities and encourage people to kill themselves. There were people who would find high-profile media celebrities and share personal information. The world has jerks, and some parts of Reddit were a home for these jerks.

Still, it’s worth wondering — what if there were better tools for finding hateful material and hiding it? What if there were better tools for banning abusive users from the community? What if the system had been easy enough to change to keep up with the way people were using it?

As they say, the medium is the message. What happens when the medium is a flawed system?

The Solution for Technical Debt

There is a way out.

Having written dozens (if not 100+) different software project proposals, I can tell you that the question people want answered is “how much does it cost to make it work?”

I think that this is a good question, but it’s not the only thing that should be asked. There’s a second question, “how much does it cost to maintain?”

The way out is knowing that maintenance is not optional.

One rule of thumb is that software maintenance averages 60% of the project costs. If someone says a piece of software will cost $1,000,000, it will probably cost over double that.  This happens on every project, with no exception.

Maintenance is important, and is usually boring. At the moment, the United States is having an issue where our infrastructure is not being maintained as well as it should be. John Oliver did a great piece on this, making fun of the boring, routine, and essential work of maintenance.

Is maintenance sexy? Not to most people. We know from neuroscientists that our brains are interested in new things by default. We’ve got to analyze new things and decide if they’re going to help or hurt us. Unless you’ve trained your brain to find long-term maintenance appealing, your brain tends to be interested in new things.

Your friends are usually more excited by the new car you bought than the fact that your old one is clean and runs smoothly (until it’s old enough to be a classic).

A new software feature is more exciting than the fact that an old one hasn’t broken, unless you’re having problems.

Any senior software engineer knows that maintenance is important. They embody the idea that maintenance is not optional — it is inevitable — so they plan for it. They write software in a way that allows it to be maintained.

Any non-engineering business leader who has led more than a few software projects knows that maintenance is not optional, so they plan for it. They structure teams and budget in a way that allows it to be maintained.

The way out is believing that maintenance is not optional.

Case Study: Toyota and Unintended Acceleration

The effects of technical debt are sometimes difficult to demonstrate. Code that is not just buggy, but is systemically bad, is a different level. When code is required to operate under serious constraints, good engineering practices need to be applied in a serious and systematic way.

Modern cars are filled with electronics to keep them running, including coordinating core vehicle services like acceleration and braking. In the mid-2000s, some Toyota drivers began reporting that their car was accelerating without them touching the gas pedal, and in some cases, while they were even stepping on the brake pedal.

Some incidents were attributed to drivers mistakenly pressing the gas pedal, or some minor physical issues in the car, such as floor mats that slid and pressed the gas pedal. Eventually, some of the issues were traced back to the software in the car.

There was a big problem with the software in the car. Actually, there were big problems. Two software experts, Phillip Koopman, and Michael Barr, analyzed the source code, and found errors that ranged from memory corruption, processes that died and disabled safety systems, systems with single points of failure, bad memory handling, and thousands of global variables.

That last one is particularly egregious in my opinion, because some of the other problems they found were bad, but they were the type of problems that are difficult and require specialized knowledge. I’m not suggesting that software engineers who design critical systems shouldn’t have that specialized knowledge — I’m saying that at least the problems were difficult problems.

Global variables are a subject that gets discussed in nearly any freshman-level software engineering course. They’re widely known to be a risky practice among any modern software engineer — the term we use to describe their existence is that they are a “code smell”.

Put another way, global variables may not have robbed the bank, but they were spotted with large duffle bags, speeding away from the scene (presumably in a Toyota). It’s fair to ask what they’re doing there.

For there to be thousands of global variables used in the car’s software, it does not only implicate the software, but the process used to make the software. My sense is that the system was so bad that it was actually impossible to fix in the environment. Producing good work requires having standards, both within the engineering team, and also the organization’s expectations for how they work with the engineering team.

This type of technical debt is unmanaged. It is not the sort of debt that is accrued by an organization that understands why it is there. It is systemic, and because the system was designed inappropriately, it was not possible to make meaningful improvements. One assessment said that the code was unmaintainable, and that because of this, even if “your car has the latest version of the [software] doesn’t mean it is safer necessarily than the older one” 1

To me, the situation evokes the Challenger disaster. The root cause ended up being O-rings that would fail under certain circumstances. In retrospect, the cause of the problem was shockingly simple — Richard Feynman (the noted physicist) was able to reproduce the problem with a glass of ice water. There are signs that engineers in both organizations knew that there were problems, but that the organization as a whole failed to respond in an appropriate way.

Improving a large codebase for an embedded system offers many of it’s own challenges. How would you approach improving a system like this?

For more reading:

In the beginning…

Hi, I’m Tim.

I’m a software engineer who has been writing code for a very long time. I was lucky enough to have a computer before I was 10 years old, and have always been a voracious learner in the subject.

I’ve also been an eager student of business, also from a young age. Dinner table discussions in my family usually revolved around business issues, which helped stoke my interest.

I think this is why I find the subject of technical debt so fascinating. The very term itself is a combination of two things I enjoy, and it also represents the intersection of the two subjects (as well as blending concepts from many other interesting fields).

If time and money are no object, technical debt is hardly worth discussing. When I write code for my own enjoyment, I can spend as much time as I like crafting it, much as an artist can refine their work until it makes them personally happy.

In the business world, however, time and money are far more limited. This is how technical debt becomes interesting. Technical debt tends to emerge under business pressure, and causes problems at the same time.

todd sawicki on digg's technical debt

Technical debt affects day-to-day issues of getting work done, pushing new features, keeping customers and employees happy, as well as bigger business issues, including acquisitions (who wants to buy a business only to find out that it is a complete mess under the hood?)

I began writing a book on the subject, and I quickly realized that a better starting point would be a blog.

With this blog, I’m going to explore the (sometimes nuanced) causes, effects, and solutions of the subject. I hope you, dear reader, will enjoy reading, and (since I certainly don’t know everything) will share your stories with me as well. I’m going to enjoy it, and I hope you do too.