A/B Testing Analysis: A How-To Guide
Learn how to analyze the results of your A/B tests to avoid misinterpretations.
Learn how to analyze the results of your A/B tests to avoid misinterpretations.
You’ve successfully launched an A/B test, and now you have a pile of data sitting in front of you.
Your boss is demanding answers: which variation won?
While it may be tempting to read off the topline numbers and make a decision, A/B testing requires careful analysis to make sure the data is valid, the difference between variants is truly significant, and the reason is clear as to why one version outperformed the other.
Hasty analysis has consequences. If your data is inaccurate or misinterpreted, you’ll be sinking time, money, and resources into implementing a change that wasn’t actually received well by customers.
First, double check the data. The last thing you want to do is base decisions off flawed analysis and a false positive (also known as a Type 1 error). Think back to how the principles of simplicity and consistency were applied to your original experiment:
Consistency: Was the same variable tested throughout the duration of the experiment? For example, if you changed the email sign-up button on your website from its original color to purple, did you keep it purple the entire time? Changing the color to green, pink, or blue during the experiment would invalidate the results.
Simplicity: Did you test more than one variable? It’s best practice to test one element in an A/B test (if you want to test multiple, a multivariate test is the better option). So, if the goal was to see what would increase email signups in a campaign, each A/B test should have focused on one thing, whether it was the subject line, CTA placement, etc.
Here are a couple common mistakes when it comes to both data collection and analysis, which can impact the reliability of results:
Too small a sample size - You need to make sure your experiment ran for long enough and had enough users participate (to reach a confidence level of 95%). A calculator like this can help you figure out how many participants are necessary.
The timing wasn’t right - A/B tests should run for at least one to two weeks, or until statistical significance is reached.
Too much outlier data - Outliers are data points that fall far away from the median. You can consider removing these, but with caution. For example, an outlier caused by a bot is worth removing from your A/B test.
A/B testing revolves around two hypotheses, the null and the alternative. The alternative is your prediction (e.g., if I change the color of the sign-up button to red, then more people will click on it). The null hypothesis is the opposite; e.g. changing the color of the email sign-up button will have no effect on how many people click on it.
The p-value (or probability value) measures the statistical significance of each test, or the likelihood that what was observed in the data was due to the variable (rather than chance). The general benchmark is that when the p-value is 5% or less, your data is considered statistically valid, and that you can reject the null hypothesis.
Manually calculating the p-value is very complex. Most A/B testing tools, like Optimizely or Google Analytics Optimize, will have the p-value calculation built into their dashboard.
After you establish the data you collected is valid, it’s time to dive into the actual results. This is where you’ll determine whether there was a meaningful difference between variations, and if implementing the change will drive business outcomes. Here are a few questions to ask yourself when analyzing an A/B test:
Is it statistically significant (e.g. confidence level of at least 95% and p-value of 5%?)
Was the audience large enough?
What is the lift rate, or the difference in conversion rates between variation A or B?
Was the audience influenced by outside events (e.g. holidays) or the novelty effect (e.g. an interest in a new feature/change that fades over time).
What can we learn from a losing variation or failed experiment?
Once you’ve established which version outperformed the other, it’s time to understand why. This will help inform your future marketing strategy, set better KPIs, and set a successful starting point for launching new campaigns.
Perhaps moving the location of the checkout button made it more visible to desktop and mobile users, and that’s why you had a high click-through rate. Maybe sending the weekly email in the afternoon as opposed to the morning led to higher open rates because it was around the time most people finished working.
Once you see what’s working (and what isn’t), you can iterate on original test ideas to further prove your hypothesis, and continuously optimize campaigns and workflows.
Here are a couple of ways to gain insight into why one variation was the winner.
Further dividing the data into demographic segments can help draw more detailed insights. Here are a few ideas on how to segment data to find new takeaways:
Demographic - Job title, industry, age, etc.
Geographical - Location, region, time zone, etc.
Behavioral - new vs. returning visitors, mobile vs. desktop, high-value shoppers, etc.
It’s recommended to segment data after the initial round of experiments, to help discover these insights in the first place. Once you’ve spotted a pattern in behavior in specific audience segments, then you can start implementing more targeted tests.
(With a customer engagement tool like Twilio Engage, you can easily segment users based on specific events or behaviors, and update these audiences in real time.)
Factors like holidays, seasonality, or the novelty effect could have had an outsized impact on A/B test results, which should be taken into account during your analysis.
Think about seasonality. Certain holidays or seasons can bring more (or less) traffic to your homepage and affect engagement on social media or via email. Another interesting factor could be virality: did a recent TikTok cause a spike in purchases of your product and an increase in traffic to your site? If you ran A/B testing during that time period, implementing the experimental version of what was tested (if it performed well) may not produce the same high conversion metrics in the future.
Once you’ve identified the factors that contributed to the success (or failure) of an experiment, consider implementing the change, and document what you learned to hone your testing strategy.
While we call them winning and losing variations within the confines of the experiment, there are truly no winners and losers in A/B testing. You can still gain insight into consumer behavior with every test – you simply find what is working, and what isn’t.
You have a few options once you discover a winning variation. You may want to run more tests to confirm the initial outcome. Or, if you’re confident in the results, you can implement this change across the organization.
Sometimes, that involves including technical experts to change things on the website or working with the social media/email marketing team to tweak copy and design.
A phased rollout is usually best to give internal teams and your user base time to adjust and avoid any missteps in the changeover.
When the variation doesn’t perform better than the original, don’t panic. It’s not a failed test. You’ve actually gained valuable information and are one step closer to finding something that does work, because you can rule out that hypothesis.
Keep a log of A/B test failures. That way, when you or your coworkers draft new copy or generate new designs, you know what to avoid.
Connect with a Segment expert who can share more about what Segment can do for you.
We'll get back to you shortly. For now, you can create your workspace by clicking below.
Many A/B testing tools can automatically calculate the p-value and confidence interval for an experiment to ensure that it is statistically sound. (As a reminder, the p-value should be 5% or lower, and the confidence interval should be 95%.)
Other important factors include ensuring there was a large enough sample size for the experiment, testing for a long enough amount of time, and keeping the variable consistent throughout the test.
Marketing teams should not use A/B testing if their website isn’t getting consistent visitors or substantial traffic. You won’t have enough data to gain meaningful, accurate insights into user behavior. Or an A/B test may not be the right framework for your experiment. Other tests include multivariate testing of multi-page testing, as two examples.
Twilio Engage consolidates data from across an organization. Teams are able to create highly specific audiences, and integrate data from A/B test campaigns to holistically understand consumer behavior.