Skip to main content

10 Common AB Testing Mistakes to Avoid in 2023 (with expert advice)

AB testing mistakes

A/B testing is a method of comparing two versions of a webpage, email, or ad campaign to determine which one performs better. 

It involves creating two versions, showing them to similar audiences, and measuring the difference in conversions, click-throughs, or other metrics.

A/B testing is an essential tool for data-driven decision-making and optimization. It provides objective data on the impact of changes, allowing you to iterate and improve.

However, there are some common mistakes AB testing mistakes that can undermine results. Avoiding these errors is critical for accurately assessing the performance of variations and making the right choices. 

In this article, we will outline the top 10 common A/B testing mistakes. Being aware of these pitfalls will help you design rigorous tests, analyze results correctly, and make data-backed decisions to boost conversions and other key metrics. 

Careful attention to your testing methodology is essential for gaining the true benefits of an optimization mindset.

Let’s get started

Mistake 1: Running Tests Without A Hypothesis

Too many A/B tests launch without clearly articulated hypotheses, undermining the entire learning process. Well-defined hypotheses provide focus and prevent just fishing through data.

Specifically, state how a proposed variation will impact a key metric relative to the control. For example, “Decreasing form fields from 10 to 5 will increase conversion rate by 15%.”

Never conduct aimless tests without firm expectations. Avoid catch-all hypotheses like “Testing new checkout flow.” 

Hear from David Sanchez (Head of Optimization)

‘’A hypothesis is “what you’re trying to prove/disprove”. An execution is “how you’re going to try to prove it”. For each hypothesis, there may be hundreds of executions, and just because an execution failed it doesn’t mean that the hypothesis doesn’t hold value. E.g. you could hypothesize that adding video to PDPs would make people more interested in your products, which you would measure by the number of people watching the video and proceeding to purchase (crude example). Now, if it didn’t work… does it mean that videos don’t work, or could it be that the video was bad?’’

Identify one primary success metric instead of diluted objectives.

David had this to add:

‘’By definition, the Primary metric should only be one, and it’s the one that proves/disproves your hypothesis. 

Guardrail metrics are very important because they help us extract deeper learnings and understand the nuances behind a result, but there should only be one primary, and never double-barrelled. booleand won’t cut it: either AND or OR are unacceptable. 

E.g. “User research tells us that users are concerned about whether they will get their deliveries on time, which could affect their likelyhood to buy. We believe that If we provide clearer information about delivery times, we’ll positively address this concern. We will know this if we see an increase in conversion rate AND revenue per visitor AND average order value”. 

Now… what if we get an increase in conversion rate and revenue per visitor, but the average order value remains untouched? Did we not address the concern? What if the conversion rate didn’t change, but the other two did? What if only one changed, or some of the others went down? 

Choose a single primary metric. the one that proves/disproves your hypothesis. With the other ones, refine your decisions and formulate new hypotheses if you must. OR statements are even worse: 

E.g. “User research tells us that users are concerned about whether they will get their deliveries on time, which could affect their likelyhood to buy. We believe that If we provide clearer information about delivery times, we’ll positively address this concern. We will know this if we see an increase in conversion rate OR revenue per visitor OR average order value” is plainly cheating. 

If you’re going to add 3, you may add 27 metrics with OR statements and hey! chances are that one of them will be up so you can claim a winner.’’

Also, beware of HARKing – conjuring hypotheses after results are known. This hindsight bias invalidates the methodology by shaping hypotheses around observed data patterns.

Before launching any tests, take time to research and formulate a single-directional hypothesis for each test. Then rigorously evaluate only that prediction, resisting the urge to mine metrics for other positive results.

Document hypotheses upfront in a registered protocol to prevent HARKing. Let clear hypotheses drive analysis, not the other way around. This approach yields validated learnings rather than just storytelling.

Mistake 2: Ignoring Statistical Significance

Statistical significance refers to the 95% or higher confidence that the observed difference between A and B is not simply due to random chance. It gives you a high degree of certainty that the changes you made to version B have resulted in real, measurable gains (or losses) in key metrics like conversion rate, revenue, engagement, etc. Statistical significance is crucial for having trustworthy results.

A major mistake is stopping A/B tests too early and launching unproven changes before reaching statistical significance. This often happens because of limited traffic and test duration. 

However, without significance, you can end up making wrong decisions thinking version B outperformed when random fluctuations were the cause. I’ve seen companies waste thousands re-developing and implementing changes that made no real impact.

The consequences can be dire. You may launch losing variants and miss out on the potential revenue lift of the winning version. Development and marketing teams waste immense effort and resources implementing meaningless changes. False positives from lack of significance lead teams down the wrong path.

Hear from Phil Cave (senior optimizer):

‘’if you don’t have enough traffic/ conversions going through a test then you won’t get a statistically significant, meaning there’s a greater chance of a ‘false positive’ (and you’d implement something that was ultimately a loss on to your site). Test for at least 14 days and aim for 90%+ stat sig.’’

To avoid this:

  • precisely determine sample size needs upfront using power analysis tools based on baseline metrics. 
  • Set minimum test durations – often 2-4 weeks for significance. Use sequential testing to reach significance faster. 
  • Resist the temptation to get excited by initial promising data which can swing rapidly. 
  • Have patience and discipline to let the test run fully before taking action. 

Mistake 3: Testing Too Many Variables At Once

The number one rule in A/B testing is isolate variables. Changing multiple elements at once muddles results and wastes testing potential. 

Don’t make the mistake of throwing in that headline change and restyling buttons simultaneously. Did the headline lift conversions or the new buttons? You end up unsure of what exactly to implement going forward.

Instead, be methodical. Test the new headline against the old first. Keep everything else exactly the same. Once you confirm the winning variation, roll it out site-wide. Then test the button change separately. 

Gradual optimization through focused, incremental testing generates the clearest data and highest ROI.

Resist impatience to test too many variables together. Prioritize the one or two highest impact changes. Isolate them in clean A/B tests. Through this approach, you will gain visibility into what is truly moving the needle for conversions. And you’ll avoid misinterpreting results due to muddled variables.

Phil Cave added this:

‘’If you make 10 changes on a page, test it and it wins how do you know which of those 10 changes made the difference? Chances are it was only a few of them that caused the uplift, with a few more doing nothing and some of the others causing a loss. Testing individual things (through multiple variants for speed, rather than multiple tests) is a much surer way of being able to confidently only implement wins.’’

Mistake 4: Neglecting The Importance Of Sample Size

Sample size is one of the most crucial elements in designing statistically valid A/B tests. Too small of a sample leads to highly unreliable results prone to noise and normal random fluctuations.

For example, testing a new homepage design against the current homepage with just 500 users could produce entirely skewed data on conversions, bounce rate, session duration etc. Those metrics could swing wildly in either direction in such a tiny sample.

This mistake often provides false positives, where a poorer performing page appears to beat the original simply due to chance variance. It can also lead to false negatives, where the higher converting version is incorrectly rejected.

To determine an appropriate sample size, you can use calculators. Plug in your baseline conversion rate, minimum desired effect size, confidence level and other factors. This will estimate the minimum sample needed to achieve statistical significance.

For most website tests, you’ll need thousands or tens of thousands of users in each variant. Build up traffic through email campaigns, social media, paid ads and other channels. Guide users into your test funnel. Don’t even look at data until you’ve reached the needed sample.

Proper sampling should detect true differences between A and B, not temporary fluctuations. It’s worth taking the time to calculate sample size needs rather than wasting testing time and effort with unreliable data.

Mistake 5: Running AB Tests On Low-Traffic Sites

A/B testing is only effective if implemented on sections of your website or app with enough traffic volume to produce statistically significant results. Running tests where there are too few visitors is one of the most common rookie mistakes.

Low traffic levels lead to small sample sizes, often just a few hundred users in each variant. With such limited data, any metric differences seen between versions A and B are unlikely to be real. The smaller the samples, the more susceptible results are to normal data fluctuations.

For example, testing two longform sales page layouts with only 50 visitors per week will take many months to reach conclusive sample sizes. In the meantime, the data will swing wildly based on chance. You may falsely conclude a poorer performing page is better.

Only conduct A/B testing on high-traffic areas like your homepage, product pages, shopping cart flows, etc. Use power analysis tools to estimate the minimum visitors needed based on conversion rates. If existing traffic is inadequate, build volume through marketing campaigns before testing.

Attempting to test without sufficient traffic undermines the entire optimization process. Without statistical power from larger samples, you cannot trust the results. Be patient, drive more visitors to key funnels first, and then begin testing armed with the traffic to obtain significant data.

Mistake 6: Overlooking External Factors

A/B tests do not occur in perfect isolated conditions – many external or environmental factors can influence results. Failing to properly account for these overlooked variables is a major testing mistake that can severely skew data and decisions.

For example, running a homepage test over the holidays could see conversions spike across all variations due to seasonal traffic surges, masking poorer performance of Version B. Without considering timing and seasonality, you might launch the worse-performing variation site-wide.

Other examples include:

  • Site outages or technical issues that may suppress conversion rates for part of a test.
  • Marketing campaigns or special promotions that bring in different audience segments.
  • Competitor actions like sales, coupons, or launches that influence shopping behavior.
  • Current events, news, or cultural trends that impact user intent and actions.

To mitigate the influence of external factors, carefully segment test data by parameters like traffic source, geography, demographics. Look for patterns tied to outside events. Extend test durations and timeframes to smooth out effects. Use statistical tools to analyze variability and anomalies.

Accounting for real-world external factors is crucial for drawing accurate conclusions from A/B tests. Consider running simultaneous identical tests to identify external impacts on conversion. Controlling for confounding variables ensures your test results reflect the true effects of your changes.

Mistake 7: Relying Too Heavily On AB Testing.

A/B testing is a valuable optimization tool, but over-reliance on it can severely limit your success. Effective optimization requires integrating A/B testing into a holistic methodology using many data sources and insights.

Solely looking at A/B test data blinds you to the bigger picture. User surveys can uncover engagement and satisfaction issues that conversions do not reveal. Focus groups provide qualitative insights into problems with messaging and content that tests miss.

An obsessive focus on incremental test results also risks creating a culture that reactive and metric-driven instead of guided by customer needs and the user experience. True optimization requires a deep empathetic understanding of your users.

Use A/B testing judiciously to validate ideas and refine specific elements, not decide strategy and direction. Complement tests with heuristics audits, usability studies, expert analysis and other research to gain a well-rounded view. This helps drive transformational innovation, not just incremental gains.

Adopt an optimization methodology involving upfront research, creative ideation, rapid prototyping and experimentation. A/B tests tell you if a button color improved conversions. But holistic analysis of the customer journey should shape how you reinvent the overall registration process.

Find the right balance of data and creativity. Let A/B testing optimize page details while insights from researchers, designers, copywriters and other experts guide the big picture experience. An integrated optimization approach leads to the greatest impact.

Mistake 8: Not Segmenting Test Data

Proper segmentation is essential for accurately interpreting A/B test results. Never rely on aggregate data alone – always break down test data by relevant visitor cohorts.

Slice data by parameters like device type, geography, new vs returning visitor, traffic source, demographics, and more. Analyze the metrics separately for each segment. You’re likely to find significant performance variations.

For example, one homepage variant may lift conversions among mobile users while another works better for desktop. Segmenting by device would expose these differences while aggregate data could mask poor mobile performance.

Plan your segmentation strategy upfront before launching tests. Dig into the data to understand variances between segments. Look for optimization opportunities to improve specific underperforming groups.

Beware of misleading aggregate data that can hide segment-level insights. A version may appear to lift revenue overall but actually suppress conversion rates among key user segments. Granular analysis is vital.

Mistake 9:Basing Your Data on Only GA and Other Analytics Tool

Analytics platforms like Google Analytics offer a seductive abundance of data. However, an over-reliance on analytics metrics to guide A/B testing decisions is a recipe for failure. The data lacks context and qualitative insights about the why behind user behaviors.

While analytics provide the what through metrics like bounce rates and conversions, you remain blind to the underlying reasons driving the numbers without deeper research. Surveys, interviews and observational studies reveal motivations, problems and needs that analytics tools cannot uncover.

Analytics data and reports also provide limited strategic context. Experts must interpret changes in key metrics and connect insights to overarching goals. Testing decisions should align to a clear optimization strategy, not just react to incremental data points.

To drive transformational innovation, A/B testing requires a truly holistic data approach. Build a comprehensive optimization data stack combining quantitative and qualitative sources: analytics for metrics, qualitative research for insights, subject matter expertise for strategic context.

Effective A/B testing derives from the complete voice of the customer. It connects present observations with future ambitions. Testing grounded solely in myopic analytics data leads to misguided incremental changes. But holistic inputs drive high-impact optimization guided by wisdom.

Hear from Phil Cave:

‘’GA (and other tools) only tell you where a problem is – not how to fix it. That can only come from understanding your customers. Surveys, interviews, work groups, user testing – these are time consuming and can be difficult to get right. But that’s where you can unearth the gold’.’’

Mistake 10: Trying To Make Sure Every Test Is Pixel-Perfect

The temptation always exists to perfect A/B test variations before launching, but this pursuit of pixel perfection severely impedes optimization efforts. Prioritize speedy experimentation over flawless execution.

Remember, the core goal of early testing is accelerating learnings through rapid prototypes, not showcasing visual design excellence. Launch simplified MVP variation experiences to gain quick directional data.

Obsessively polishing branding, micro-interactions, and style guide alignment may take weeks when that time is better spent testing new ideas. Refine visual details later after data reveals what resonates.

Focus engineer and designer bandwidth on nailing core functionality first. Build just enough to test the concept quickly. Pixel-pushing the details can wait for later design iterations.

Maintain reasonable quality standards but avoid needless rounds of feedback and approvals. Ship the core idea faster to start spinning the optimization flywheel. Prioritize major structural changes over micro copy tweaks.

Hear from Phil Cave:

‘’If you try and make every test pixel perfect and 110% on brand, you’ll only run 1 or 2 tests per month and risk losing revenue. Spend time getting the wins looking right… not the losses.’’

Frequently Asked Questions on Common AB Testing Mistakes

1. Q: What is a common mistake when setting up A/B test variations?

A: Not keeping variants realistic. Designs should represent credible options, not exaggerated versions just to drive a difference.

2. Q: Why is having a single, clear hypothesis important?

A: It focuses the test on a specific expected outcome instead of launching aimless tests or mining data to rationalize results.

3. Q: When should you end an A/B test?

A: After it reaches statistical significance or completes the predetermined duration per your test protocol to avoid stopping early based on normal data fluctuations.

4. Q: How much should you change between the control and variant?

A: Limit to one core change rather than multiple variables. Isolate the impact of each change in separate tests.

5. Q: What’s a mistake with sample sizes?

A: Having too small of samples. Ensure enough traffic over test durations to reach statistical confidence in the results.

Final Thoughts

Is your CRO programme delivering the impact you hoped for?

Benchmark your CRO now for immediate, free report packed with ACTIONABLE insights you and your team can implement today to increase conversion.

Takes only two minutes

If your CRO programme is not delivering the highest ROI of all of your marketing spend, then we should talk.