To go fast, discard your A/B tests after use

As an engineer, I observed a very clear difference in mindset for myself when writing code that I knew was likely to be maintained in perpetuity instead of some very temporary code that would likely only be used for a few weeks before being removed. One impactful operational change I made to one of the first growth teams I led was to start the development of each A/B test with the assumption that we would throw away the code written for it. In practice, we didn't actually throw away 100% of the code we wrote for every test, but the extreme nature of the statement helped us shift our mindset toward behaviors that accelerated the team.

There was an instance when a single engineer on this growth team spent about two weeks building an A/B test that effectively just swapped out some visual elements on various pages -- hiding some and replacing them with others using client-side javascript that was injected by the experimentation tools. The test turned out to be a big winner, showing a substantial positive impact.

Once we observed this result, we threw away the test code and started making the necessary changes to the product. We scaled up the number of engineers working on the changes, eventually reaching a peak of eight engineers– all working full-time. Even with an expanded team, it still took us four months to complete this project.

If this test hadn’t produced a positive result and we tried to build the whole feature before we tested it, we could have wasted 32 engineer-weeks worth of effort. By allowing ourselves to build a version of the test that we were willing to throw away, we were able to test our hypothesis in a small fraction of the time.

In other contexts, the ratio is not as favorable -- the cost of building the A/B test is comparable or even more expensive than just building the feature. However, even in those contexts, the team will typically bolt on the new feature behind a feature flag without fully integrating it into the codebase. So it's worth evaluating the "throw it away" strategy to see if the benefits outweigh the costs given the considerations detailed below.

Defers solving scalability problems¶

One time, our growth team needed new data to use in a test that wasn't exposed to the front-end via any existing API. We looked into adding it, and it was going to be expensive and difficult to do. However, the data changed very infrequently, so we ended up just running a SQL query on our data warehouse, dumping the result to JSON, and pasting it into our test. This worked great for the handful of weeks the test was live! I suspect this non-scalable solution saved weeks of effort that ultimately would have been wasted, since that test didn't actually have a positive impact.

For another test, we needed to send batches of emails out every day. Our current email automation platform didn't have access to the data we needed to both identify the audience and build the content of the emails. Getting the data into the system would have been quite difficult, so we wrote a python script that fetched the data we needed from the data warehouse, rendered the HTML for the emails, and called the "send single email" API of the email marketing platform to send each email. We ran this script every morning around the same time for a few weeks. Clearly this wouldn't work long-term, but it was good enough for the test.

Defers many edge cases¶

At times we also ran tests on a subset of the population that allowed us to ignore certain edge cases. Doing so dramatically simplified our code and solutions. For example: we might only target English-speaking users to avoid the complexity of having to translate all of the strings in our test UI. The risk here is that the sub-population might not be a representative sample of the full population, which limits the generalizability of the results. This is a trade-off that the team can manage as long as they are aware of the risks and benefits.

Defers refactoring and architectural work¶

By throwing away our A/B tests, it allows us to defer any refactoring and architectural work until we are confident that the change has sufficient value to justify its cost of development and maintenance in perpetuity. Without first estimating the value of the change, it’s easy to question whether refactoring is "worth it" to delay shipping the change to production. If, instead, we run an A/B test and have an estimate of the value, it's much easier to bite off a reasonable amount of architectural improvement and refactoring to include in the delivery scope to leave the codebase better than we found it.

In practice, I've actually not seen this benefit as much as I had hoped. Once we have a positive A/B test result, we usually want to deliver it into production as quickly as possible to start capturing value and then do a second iteration where we refactor the solution and integrate it more holistically into the architecture.

On one growth team I worked on, we actually tried this both ways. In one case we did the architectural work during the initial delivery and in the other case we tried to do it in the second iteration. Doing the architectural work during the delivery worked much better than attempting to do it in the second iteration. The second iteration never happened, and we were left with a messier codebase, not a cleaner one.

Defers the handling of many baseline requirements and quality controls¶

Many product delivery organizations have baseline requirements that are applied to all changes; things like automated tests, internationalization, accessibility, etc. These things are very important for software that will be maintained in perpetuity, but if only ~30% of A/B tests will actually produce positive results and need to be maintained, then most of the time spent on accommodating these requirements is waste. By throwing away your A/B tests, you give yourself permission to defer solving for these requirements until you see value. Save your effort on this stuff for the ~30% of changes that will actually have an impact!

Minimizes the volume of experimental code in your system¶

Every line of code added to your codebase has a carrying cost. Every future change to the software that comes after that code was added is impacted by it. The developer of that future change will likely end up running automated tests that exercise that code, even if they aren't changing it at all. It adds complexity to the system that future developers have to grapple with as they continue to improve it. It's code that has to be refactored when bigger changes are made. It's code that has to be changed to accommodate major version upgrades in core dependencies like Django or Rails.

The metaphor I like to use for this is that a product delivery team is like a hiker trying to climb a mountain as quickly as possible. Every time we add a feature to the software, it's as if the hiker is picking up a stone on the path and putting it in her backpack. Even if the hiker only picks up small stones, their weight will start to slow her down. When we A/B test our changes, it's as if she pauses to evaluate each stone before adding it to her backpack to see if it's worth carrying up the mountain or not. More extensive changes to the software system are like bigger rocks that have a higher threshold for value to justify carrying their weight all the way up the mountain.

AI Generated Image of a Hiker Looking at a Stone She Picked Up — This image was generated with the assistance of AI