Duplicate code

Duplicate code is one form of technical form of technical debt that adds problems where you might not expect.

As a simple example, let’s take email address validation. Email address is the type of thing that seems simple, but is actually devilishly complex. As an example, here is a validator that conforms very strictly to the actual RFC, and is almost 7k of code, including explanations so you can figure out what it is doing.

I happen to know about this regex because it was written by a friend and coworker, Sandeep Tamhankar. He wrote it years ago just to prove a point. In the real world, almost no one ever uses Sandeep’s regex. They’ll invent their own version to validate email addresses, or they’ll avoid regular expressions entirely. The point is that there are a huge number of ways to do something very basic, like deciding if a piece of text is an email address or not.

This brings me to something that happened the other day. I was doing a push of some new code, and it required a migration of some data into a new model. Everything was going well, until the script stopped. The script was idempotent, so I knew it was safe to run again. As a first shot at solving the problem, I just re-ran it. No luck.

Now, these things happen. No problem. I dug into the migration, looked around at the code, and after a while, I found out that the new code to store the data was rejecting the record because it said that a field that should have been an email wasn’t.

This was particularly strange, because the data was coming from our own system, not an import from a third party. We already knew these were email addresses. There should have been no problem. But sure enough, there was an error staring me in the face saying “nope, it’s not an email”

I took a look at the record, and there definitely was something unusual. Over the past few years, there has been a proliferation in TLD (top-level domains) in use. We used to have good ol’ .com, .net, .org, .gov, .edu, and a bunch of 2-letter country codes. Now we have a lot more.

It turned out that in our system, there was more than one way to validate an email address. Some email addresses had made it into our system past one type of check, but was causing trouble as we tried to move it to another part of the system, because a different type of check was being done.

This highlights a risk from technical debt caused by code duplication. Nearly every software engineer has heard the expression “don’t repeat yourself” or DRY. Obviously our code failed that test (although this instance was quickly fixed so that we could finish the migration).

This is the impact of non-DRY code, and captures the link to technical debt. Something as simple as validating an email address stopped us from deploying new features and bugfixes, because we had to investigate and resolve something that happened because the second person wasn’t familiar with enough of the code to know that someone else had already solved the problem, and they could reuse the first solution.

What had happened if this wasn’t as simple as an email address? What if this was a complex system process?

The lesson? Software engineers should always be mindful of duplicated functionality, and when it is found, refactor it out into common modules whenever possible. The code ends up being cleaner, more maintainable, and increases uptime.

Leave a Reply

Your email address will not be published. Required fields are marked *