Designing statistically significant growth experiments

Statistics was one of those required classes in high school or college that many of us dreaded for its dull material. As a marketer or product manager, it turns out that those stats skills can prove to be useful. 

When working with my consulting clients, many of the campaigns we run are designed as experiments—tests we run to determine go / no go decisions on channel strategies. 

As an operator and consultant, I’ve been surprised to see how many companies run campaigns without (1) a clear point of view on what they’re testing, or (2) an eye towards statistical significance. 

Running a statistically significant test is important for many reasons. First, it gives you data you can feel confident about using for important business decisions. If you’re making decisions based on incomplete or incoherent data, you may find yourself throwing good money after bad. 

Setting up your experiments to be statistically significant flexes your data muscle. It helps you develop a discipline for letting the numbers do the talking. I can’t tell you how often I’ve seen growth orgs make decisions based on “gut” or untested ideas floated within the company. Having a system for how you set up and track your campaigns with a critical mass of the right data helps teams to make smarter and more impactful business decisions over time. 

This post is geared towards marketers running paid acquisition experiments, though many of these principles could be applied to product marketing or other tests.

Just like the scientific method: start with a hypothesis

Determine what you’re testing and be as specific as possible. As in, are you trying to determine if a certain color button converts better by grabbing more people’s attention? Or are you trying to prove that images of people perform better than more generic lifestyle photos in ad creative? 

Put a stake in the ground, determine what exactly you’re testing, and then determine what variables you need to test vs. hold constant throughout the experiment. 

Next: isolate your variables

Before you even get started, it’s important to know what exactly you’re testing. The more variables you hold constant, the better clarity you’ll have at the end of the experiment. Be clear about what’s important for you to learn from this test and how you’re going to leverage that insight going forward. 

Begin with the end in mind: what is the most important insight you want to take away from this test? It could be headline copy that best converts on a Facebook ad and associated landing page. Maybe it’s a button color or header image.

It’s a tough discipline but try as hard as you can to manipulate as few variables as possible to avoid muddying your data. 

As an example: I worked with a financial services client on their funnel optimization. Their onboarding funnel involved nearly a dozen steps, and our hypothesis was that one crucial step in onboarding influenced the likelihood of their providing an email, an important asset for future remarketing opportunities.

We drove traffic to the site with Facebook ads, pointing to a landing page where they initiated the signup flow. 

There are hundreds of variables we could have manipulated for the test—ad copy, imagery, button colors—but we wanted to walk away with a clear understanding of whether changes to this one step of the flow influenced the likelihood of someone providing their email.

Then: determine the right sample size

Once you determine what you’re testing, you need to figure out how many people need to see the test in order for your results to be meaningful. In a traditional A/B test, you’re evaluating the performance of two different variations. That means you need a large enough sample size for each variation. 

There are a few good sample size calculators out there. I like to use this one from Creative Research Systems. 

To back into your sample size, you need to start with two important assumptions: your confidence interval—AKA your margin for error—and your confidence level. 

Your confidence interval helps you home in on an expected range of results. In your experiment, if your confidence interval is 4 and you find that 55% of your people convert with a green button instead of a gray one, you can expect that it’s likely that anywhere from 51-59% of people in your target population would choose the green button over gray. 

Typically, the larger your sample size, the smaller your confidence interval.

Your confidence level is exactly what it sounds like—this represents how certain you can be about your results. It is a percentage that shows how sure you can be that your results fall within the confidence interval. Most researchers use a confidence level of 95%.

With your confidence level and interval determined, you then need to determine your total addressable market. In my client example above, we looked at the total number of men and women within a certain age bracket and income level in the United States. To find that info, you can use census data, or I like to use audience sizing tools in Facebook or AdWords. 

With that information, you can plug in your confidence level, interval and total population into the sample size calculator, and voila! There you have the number of people you need to reach with each variable (i.e., multiply that sample size by two for your total reach). Be sure the sample you target is a random sampling of people within that total population, to ensure the most reliable results. 

Finally: Work within your budget

Once you determine your sample size, you may find your budget doesn’t fit the number of people you need to reach. It also depends on your acquisition channel. If you’re running a test on email, that may be less of an issue than on PPC channels like Facebook, AdWords or LinkedIn. 

You can modify your assumptions, such as widening your confidence interval or lowering your confidence level, to accommodate your budget. 

So there you have it: a model for designing statistically significant experiments. 

Happy testing!