Unmanagability

On July 8, 2015, on an otherwise ordinary day, all United Airlines flights in the US were grounded, the NYSE computers crashed, and WSJ.com was down.

The thing is — it was an ordinary day.  Some computers crashed. Except they happen to be major computer systems, and it turns out they crash fairly often relative to their importance in our lives. One secret of the modern world is that technology lasts a lot longer than we give it credit for. Large chunks of the modern world still run using code that is older than Mark Zuckerberg.

Zeynep Tufekci wrote about July 8, 2015, and in a very eloquent way, having had hands-on experience working on these pieces of software that have been running for decades:

The big problem we face isn’t coordinated cyber-terrorism, it’s that software sucks. Software sucks for many reasons, all of which go deep, are entangled, and expensive to fix

It’s that simple — the modern world is tangled. Have you ever untangled a string of holiday lights? Imagine that, except there are about 100 strands all connected, tangled, and you have to untangle them without any lights going off.

This is why code stays online — a system gets written, reaches a reasonable level of stability, then gets layers build on top of it. Eventually, so many layers are running on top of it that no one is willing to touch the underlying system, because it could break everything that depends on it. After a long enough time, many of the original authors may actually have not just left the company, but actually passed away.

This is the complex modern world we live in. Don’t unplug those lights. Why the Great Glitch of July 8th Should Scare You by Zeynep Tufekci.

Choose Boring Technology

I lifted the name of this post from another blog post that I think did an excellent job of addressing the subject of tool choice. Think of this as a cover song.

New needs constantly emerge during software development. Taking an app from being installed on a single server, to handling large amounts of traffic or data, to adding data mining functionality — every new business challenge brings technical changes.

Historically, different companies have had similar technology needs at the same time, and tools show up in order to meet these needs. Typically, one team encounters a problem and builds a tool, then releases it to the public where other teams begin using it.

Actually, this process tends to happen multiple times in parallel. Today, jQuery is a de-facto web framework, but Prototype, MooTools, Dojo, YUI, and many others were created around the same time to address the increasing usage of Javascript in web pages. Any new technical challenge is likely to have many tools that address it.

Choosing new tools is an inflection point when it comes to technical debt. When Javascript libraries first showed up, it would have been remarkably hard to know that jQuery would be the winner. Reliably picking the longest-lasting tool from a new set of options should be thought of as being so hard that we should assume it is impractical.

Picking the wrong tool has serious implications for the pace of work that an engineering team is able to complete. The more people using a piece of technology, the easier it is to learn about, work with, and hire people who have skills working with it. The harder it is to learn, the slower you can work with it, and you’ll have more failures, and need to make more compromises in hiring new engineers.

These are all serious challenges that need to be dealt with, yet I’ve just told you that it is effectively impossible to pick winners out of tools that exist to solve leading-edge problems. Perhaps a different mindset is needed.

Instead of attempting to pick new tools, another approach is to assume that some new tools will need to be brought into the technology stack over time, that some of the choices will not be optimal. In this case, the goal is to ensure that this is done in a manageable way.

An engineer by the name of Dan McKinley spent over 6 years at Etsy. Etsy is a company known for having a high-performing engineering department, and this type of performance comes not only from the code, but from the philosophies of the team.

Dan has talked about one philosophy that comes from this culture — the concept of “innovation tokens”. The idea is that as you choose new pieces of technology to incorporate into a project, some pieces of technology cost you an innovation token.

Every team has a natural number of innovation tokens. Dan estimates that you should start by getting three tokens, and only getting extra tokens when the tech stack is sufficiently stable.

The way a token gets spent is when a decision is made to use a new and interesting piece of technology, instead of something that is well-known.

An example of this is database choice. At the moment, there are many databases on the market: PostgreSQL, MongoDB, CouchDB, Redis, Riak, and many others. Many of them are interesting, but if you want a truly reliable piece of technology, you choose MySQL. MySQL was released in 1995 and has been battle-tested. Everyone knows how to use it. It rarely fails, and when it does, there are standard procedures for recovering from those failures. Configurations and performance are widely known and predictable.

MySQL is not perfect, however, and the other databases exist to solve certain problems better than they could be solved with MySQL. However, they lack the battle testing and broad knowledge base that MySQL has. They have risk, and that risk will eventually show up with poor performance requiring unexpected changes to the code, slowing down feature development.

To choose a non-MySQL database is to spend an innovation token. By spending that token, you acknowledge that risk exists, and that there is a natural amount of budget in the engineering plan to handle unexpected situations, and that you are putting a limit on that budget.

One interesting social aspect to this is that the more rapidly technology is changing, the more likely people will want to use new tools, and that’s when it requires holding strong and not chasing a new trend for technology’s sake. Making a budget is easy, sticking to it is hard, but that’s how you stay out of debt.

The Solution for Technical Debt

There is a way out.

Having written dozens (if not 100+) different software project proposals, I can tell you that the question people want answered is “how much does it cost to make it work?”

I think that this is a good question, but it’s not the only thing that should be asked. There’s a second question, “how much does it cost to maintain?”

The way out is knowing that maintenance is not optional.

One rule of thumb is that software maintenance averages 60% of the project costs. If someone says a piece of software will cost $1,000,000, it will probably cost over double that.  This happens on every project, with no exception.

Maintenance is important, and is usually boring. At the moment, the United States is having an issue where our infrastructure is not being maintained as well as it should be. John Oliver did a great piece on this, making fun of the boring, routine, and essential work of maintenance.

Is maintenance sexy? Not to most people. We know from neuroscientists that our brains are interested in new things by default. We’ve got to analyze new things and decide if they’re going to help or hurt us. Unless you’ve trained your brain to find long-term maintenance appealing, your brain tends to be interested in new things.

Your friends are usually more excited by the new car you bought than the fact that your old one is clean and runs smoothly (until it’s old enough to be a classic).

A new software feature is more exciting than the fact that an old one hasn’t broken, unless you’re having problems.

Any senior software engineer knows that maintenance is important. They embody the idea that maintenance is not optional — it is inevitable — so they plan for it. They write software in a way that allows it to be maintained.

Any non-engineering business leader who has led more than a few software projects knows that maintenance is not optional, so they plan for it. They structure teams and budget in a way that allows it to be maintained.

The way out is believing that maintenance is not optional.

In the beginning…

Hi, I’m Tim.

I’m a software engineer who has been writing code for a very long time. I was lucky enough to have a computer before I was 10 years old, and have always been a voracious learner in the subject.

I’ve also been an eager student of business, also from a young age. Dinner table discussions in my family usually revolved around business issues, which helped stoke my interest.

I think this is why I find the subject of technical debt so fascinating. The very term itself is a combination of two things I enjoy, and it also represents the intersection of the two subjects (as well as blending concepts from many other interesting fields).

If time and money are no object, technical debt is hardly worth discussing. When I write code for my own enjoyment, I can spend as much time as I like crafting it, much as an artist can refine their work until it makes them personally happy.

In the business world, however, time and money are far more limited. This is how technical debt becomes interesting. Technical debt tends to emerge under business pressure, and causes problems at the same time.

todd sawicki on digg's technical debt

Technical debt affects day-to-day issues of getting work done, pushing new features, keeping customers and employees happy, as well as bigger business issues, including acquisitions (who wants to buy a business only to find out that it is a complete mess under the hood?)

I began writing a book on the subject, and I quickly realized that a better starting point would be a blog.

With this blog, I’m going to explore the (sometimes nuanced) causes, effects, and solutions of the subject. I hope you, dear reader, will enjoy reading, and (since I certainly don’t know everything) will share your stories with me as well. I’m going to enjoy it, and I hope you do too.