Reddit’s Technical Debt

Reddit is having some serious problems. Freedom of speech vs hate groups vs plain old jerks vs making some money. The (former as of last week) CEO, Ellen Pao, had a tough challenge ahead of her when she joined, and I respect her for trying to do difficult things.

The final straw for many redditors (aka: Reddit users/volunteer moderators) was when a Reddit employee was let go. This particular Reddit employee was well-liked, and many redditors felt it was the breaking point. They shut down some of the most popular parts of the site and ended up grabbing a lot of press.

It’s worth pointing out that that there was already an unhealthy relationship between the executives in charge of Reddit (the corporate structure) and the volunteer moderators who ran the site for love, not money.

To be fair, before this strained relationship, Reddit originally provided a good place for a community to form. Facebook is the face of “social media” today, but there are an enormous number of different online communities, and there always have been.

Communications technology has always been about the idea of making content available to others and providing mechanisms for people to discuss that content. Before Facebook and Friendster, there were forums, and before that there were BBSs.

There are a few things needed for a good electronic community. Let’s use Facebook as an example.

One component is the core team — Mark Zuckerberg, Sheryl Sandberg, and the thousands of people at Facebook. Another component is the visitors to the community — you, me, and the +1 billion people that show up every month. Finally, there is the platform itself. There is an enormous amount of infrastructure that combines to make the thing we call “Facebook” work.

This infrastructure part (both hardware and software) isn’t always discussed. I think part of it has to do with the fact that it is hugely complicated. The plumbing required to manage the data for a billion people each month is quite literally beyond the scope of any one person’s ability to understand it.

(I think there’s a good argument that the infrastructure of Facebook is as complex as a human body, DNA and psychology included. If you think that’s exaggerated, I suggest you consider that a lot of the added difficulty we have with bodies comes from us having to reverse engineer the blueprints, where with Facebook, the blueprints are all there, recently created by people, using language that many of us understand.)

Again, this stuff isn’t frequently discussed. 9 out of 10 conversations about Facebook aren’t about their code or their servers, but they’re an essential part of that conversation.

Reddit has a big infrastructure too, but much like Facebook, nearly all conversations about Reddit aren’t about their code or their servers. Reddit was started back in 2005, and if we make some reasonable assumptions about coding practices, we can assume that there is a serious chunk of technical debt after a decade of work.

Part of Reddit’s infrastructure that (presumably) contains technical debt are community moderation tools. These features are what make it easier for a small number of moderators to manage a large number of visitors. One example is an auto-moderator bot that moderators can train to enforce community policies.

During the recent dust up at Reddit, Ellen Pao posted about the relationship with moderators, which I’ve screenshotted below.

Ellen Pao on Reddit's Technical Debt

There were several other comments in other threads about problems with the auto-moderator, where it was enforcing rules at undesirable moments, giving the illusion that the core executive team and moderators were abusing their power to control content on the site. Even though it may have been automatic systems inadvertently changing content with no human intervention — the humans were blamed.

Given Ellen’s comments about a monolithic infrastructure, it is reasonable to assume that there was so much technical debt that they were unable to fix some of these infrastructure problems in a timeframe that would keep the community happy.

Would Reddit be a Garden of Eden if it weren’t for technical debt? Probably not. There were genuine hate groups on the site posting offensive material. There were some awful examples of behavior — people who would drop in on mental health communities and encourage people to kill themselves. There were people who would find high-profile media celebrities and share personal information. The world has jerks, and some parts of Reddit were a home for these jerks.

Still, it’s worth wondering — what if there were better tools for finding hateful material and hiding it? What if there were better tools for banning abusive users from the community? What if the system had been easy enough to change to keep up with the way people were using it?

As they say, the medium is the message. What happens when the medium is a flawed system?

The Solution for Technical Debt

There is a way out.

Having written dozens (if not 100+) different software project proposals, I can tell you that the question people want answered is “how much does it cost to make it work?”

I think that this is a good question, but it’s not the only thing that should be asked. There’s a second question, “how much does it cost to maintain?”

The way out is knowing that maintenance is not optional.

One rule of thumb is that software maintenance averages 60% of the project costs. If someone says a piece of software will cost $1,000,000, it will probably cost over double that.  This happens on every project, with no exception.

Maintenance is important, and is usually boring. At the moment, the United States is having an issue where our infrastructure is not being maintained as well as it should be. John Oliver did a great piece on this, making fun of the boring, routine, and essential work of maintenance.

Is maintenance sexy? Not to most people. We know from neuroscientists that our brains are interested in new things by default. We’ve got to analyze new things and decide if they’re going to help or hurt us. Unless you’ve trained your brain to find long-term maintenance appealing, your brain tends to be interested in new things.

Your friends are usually more excited by the new car you bought than the fact that your old one is clean and runs smoothly (until it’s old enough to be a classic).

A new software feature is more exciting than the fact that an old one hasn’t broken, unless you’re having problems.

Any senior software engineer knows that maintenance is important. They embody the idea that maintenance is not optional — it is inevitable — so they plan for it. They write software in a way that allows it to be maintained.

Any non-engineering business leader who has led more than a few software projects knows that maintenance is not optional, so they plan for it. They structure teams and budget in a way that allows it to be maintained.

The way out is believing that maintenance is not optional.

Case Study: Toyota and Unintended Acceleration

The effects of technical debt are sometimes difficult to demonstrate. Code that is not just buggy, but is systemically bad, is a different level. When code is required to operate under serious constraints, good engineering practices need to be applied in a serious and systematic way.

Modern cars are filled with electronics to keep them running, including coordinating core vehicle services like acceleration and braking. In the mid-2000s, some Toyota drivers began reporting that their car was accelerating without them touching the gas pedal, and in some cases, while they were even stepping on the brake pedal.

Some incidents were attributed to drivers mistakenly pressing the gas pedal, or some minor physical issues in the car, such as floor mats that slid and pressed the gas pedal. Eventually, some of the issues were traced back to the software in the car.

There was a big problem with the software in the car. Actually, there were big problems. Two software experts, Phillip Koopman, and Michael Barr, analyzed the source code, and found errors that ranged from memory corruption, processes that died and disabled safety systems, systems with single points of failure, bad memory handling, and thousands of global variables.

That last one is particularly egregious in my opinion, because some of the other problems they found were bad, but they were the type of problems that are difficult and require specialized knowledge. I’m not suggesting that software engineers who design critical systems shouldn’t have that specialized knowledge — I’m saying that at least the problems were difficult problems.

Global variables are a subject that gets discussed in nearly any freshman-level software engineering course. They’re widely known to be a risky practice among any modern software engineer — the term we use to describe their existence is that they are a “code smell”.

Put another way, global variables may not have robbed the bank, but they were spotted with large duffle bags, speeding away from the scene (presumably in a Toyota). It’s fair to ask what they’re doing there.

For there to be thousands of global variables used in the car’s software, it does not only implicate the software, but the process used to make the software. My sense is that the system was so bad that it was actually impossible to fix in the environment. Producing good work requires having standards, both within the engineering team, and also the organization’s expectations for how they work with the engineering team.

This type of technical debt is unmanaged. It is not the sort of debt that is accrued by an organization that understands why it is there. It is systemic, and because the system was designed inappropriately, it was not possible to make meaningful improvements. One assessment said that the code was unmaintainable, and that because of this, even if “your car has the latest version of the [software] doesn’t mean it is safer necessarily than the older one” 1

To me, the situation evokes the Challenger disaster. The root cause ended up being O-rings that would fail under certain circumstances. In retrospect, the cause of the problem was shockingly simple — Richard Feynman (the noted physicist) was able to reproduce the problem with a glass of ice water. There are signs that engineers in both organizations knew that there were problems, but that the organization as a whole failed to respond in an appropriate way.

Improving a large codebase for an embedded system offers many of it’s own challenges. How would you approach improving a system like this?

For more reading: