~/wiki / osnovy-vibe-design / ab-testirovanie-dlya-dizaynerov-poshagovo

A/B testing for designers: an algorithm that produces honest results

Main chat

A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.

$ cd section/ $ join vibe dev
A/B testing for designers: an algorithm that produces honest results - обложка

A/B testing is the most commonly misused tool in product development. Not because it's complicated. Because it is used as a confirmation of a decision that has already been made, not as a tool for finding the truth.

“We did the A/B test and the red one won.” After a month, it turns out that the test took 4 days, the sample was 200 people, there was no statistical significance, and the “victory” of red was random noise.

This article is an algorithm that gives honest results. Not fast. Not comfortable. Honest.


What is an A/B test and why is it needed

An A/B test is a controlled experiment in which two versions of a product (or one element) are shown to two random groups of users and the results are compared.

Variant A (control) is the current version. Variant B (treatment) is a new version with a change.

The goal is to find out which option is better by a pre-selected metric. Not "seems better." Not "team likes." And with measurable certainty that the difference is not accidental.

** When you need an A/B test:**

  • When there is a specific hypothesis about what can be improved
  • When the change is small enough to test a single item
  • When there is enough traffic for a statistically significant result
  • When the metric is clearly defined before the test begins

When an A/B test is not required:

  • When the problem is obvious without the test (broken UX, errors)
  • When the change is radical (complete redesign) – here you need usability tests
  • When there is too little traffic (the risk of getting a random result)
  • When you do not have the resources to do the test correctly

Step 1: Hypothesis is the most important step

Most of the bad A/B tests fail before launch, at the hypothesis stage.

**Bad Hypothesis: * Try the blue button instead of the green button.

Good hypothesis: We think users don’t click on the CTA button because it doesn’t stand out visually. If we increase it and make it more contrasting, the click-through rate will increase by 15%+.

The structure of a good hypothesis is [observation of the problem] + [presumed cause] + [proposed solution] + [expected effect].

**Why it matters: * A hypothesis with cause, not just change, allows you to learn regardless of the outcome. The test didn't work? The reason was different, it was also knowledge. The blue button test doesn’t teach you anything if it doesn’t work.


Step 2: The Metric is Only One Basic

Before starting the test, you need to select one primary metric. Just one. A decision will be made on this metric.

The primary metric is the one you want to improve. It should be a metric made of wood product metric related to the actual business outcome.

Examples:

  • Conversion from viewing landing to registration
  • Click-through rate on the CTA button
  • Completion rate of registration form
  • Task success rate on key flow

**Secondary metrics are the ones you need to keep an eye on so that option B doesn't "win" by worsening something else.

Example: Trying to simplify the registration form. The primary metric is the completion rate of the form. Guardrail - D7 retention. Otherwise, you can get more registrations, but less quality users.

**Never choose a winner on a secondary metric if the primary metric is not successful. This is called “p-hacking” and it is a false conclusion.


Step 3: Sample size - count before launch

This is the step that is missed most often. This is what determines whether the test is fair.

**Why can't you just watch in a week

If the test is stopped at a convenient time (when one option “won”), with a high probability it is random noise. This is called a “peek problem” or “optional stopping bias.”.

How to calculate the desired sample size:

We need three parameters:

  1. Baseline conversion rate - current metric value (e.g. 5%)
  2. Minimum detectable effect (MDE) - the minimum change you are interested in (e.g. +20% → 6%)
  3. ** Statistical power** - usually 80%, significance level α = 0.05

Use the sample calculator: Evan's Awesome A/B Tools or Optimizely Sample Size Calculator.

** Example of calculation:**

  • Current conversion rate: 5%
  • Expected effect: +20% (i.e. 6%)
  • Statistical power: 80%, α = 0.05
  • Sample: ~3,800 users per group (total ~7,600)

If your site gets 200 users a day, you should keep the test for at least 38 days. If this is not acceptable, you should either test something with great effect or make a decision in another way.


Step 4: Separating traffic

The standard division is 50/50. Each user accidentally falls into Group A or Group B.

** Important conditions:**

  • The division must be random (not "even vs. odd IDs" unless the ID is random)
  • One user always sees the same option (does not change on each visit)
  • Group A users should not interact with Group B in ways that affect the outcome (difficult for social products)

** Separation tools:**

  • Google Optimize (standard for the web, but disabled in 2023, need alternatives)
  • Amplitude Experiment
  • LaunchDarkly
  • GrowthBook (opensource)
  • VWO
  • Optimizely

Or own implementation through feature flags, if there is infrastructure.


Step 5: Duration of test

The minimum period is until the calculated sample size is reached. Plus a few rules:

At least 1-2 full weeks. User behavior is different on Monday and Sunday. You need to take a full week cycle at least once.

**Don't stop before. ** Even if it seems like the winner has decided. The Peek problem is a real threat to credibility.

**Consider seasonality. Testing during holidays, sales, or significant external events is not representative.

Don't change the test in the process. ** If you change the design of option B in the middle of the test, the data before and after are incomparable.


Step 6: Analysis of results

After the test is completed, analysis. Not before.

Statistical significance

The test is "significant" at p-value < 0.05. This means that the probability that the difference is random is less than 5%.

Most of the tools consider this to be automatic. But you have to understand what that means:

  • p < 0.05 - you can talk about the result
  • p = 0.07 - no statistical significance, no conclusions can be drawn
  • p = 0.01 - strong result, very unlikely to be random

Size of effect

Statistical significance says “it’s not accidental.” The size of the effect says "how important.".

The increase in conversion from 5% to 5.1% at p < 0.05 is statistically significant, but practically insignificant. The increase from 5% to 7% is both statistically and practically significant.

Always look at both parameters.

Confidence interval

The test result is not an exact number, but a range. “Conversions increased by 20% ± 8%” means the real effect is likely between 12% and 28%.

A narrow confidence interval is a great confidence. Broad is a lot of uncertainty.


Step 7: Adoption of decision

After analysis, one of three outcomes:

Variant B won (p < 0.05, practically significant effect): Introduce. But fix the winning parameters, update the baseline for future tests.

Variant A and B are not different (no significant difference): It's not a failure, it's a result. The hypothesis was not confirmed. We need to understand why and formulate a new hypothesis.

Option B lost (conversion dropped): Don't plant it. This is a particularly valuable result, it suggests that intuition was wrong.


Frequent errors in A/B tests

Mistake 1: Stopping at peak

You look at the data every day. You see, on Wednesday, option B already wins. Stop the test.

In 50% of cases, this is an accidental fluctuation. By Friday, the picture could have changed.

Solution: Determine the sample size before starting and don’t look at the intermediate results.

Mistake 2: Too many choices

The A/B/C/D/E test is five options at a time. The more options there are, the higher the probability of a false-positive result (multiple comparison problem).

If you test 5 options with α = 0.05, the probability of at least one false winner is about 23%.

**Resolution: * Test a maximum of 2-3 options. If you need more, use the Bonferroni Amendment or lower the α.

Mistake 3: Testing too little change

Changed the button color from #007AFF to #0070F3. The difference is almost invisible. It takes millions of users to find it.

**Solution: ** Test changes with an expected effect of at least 10-20%. Less is not worth the time.

Error 4: Ignoring segmentation

“Option B wins” – but the payoff is all at the expense of mobile users, and on desktop the result is neutral or worse.

Solution: After completing the test, segment the results: mobile / desktop, new / returned, different rates. Insights are often in segments rather than aggregated data.

Error 5: Novelty Effect (Novelty Effect)

Users click on option B simply because it is new. The effect passes in 2-3 weeks.

For products with a large percentage of regular users, this is a real problem. The test must be held long enough for the novelty effect to pass.

Mistake 6: No learning from the test

The test is over. The winner is embedded. No one asked why this happened.

**Solution: * Each test should end with a record in the knowledge base: what was tested, hypothesis, result, explanation of why.


Alternatives to the A/B test

The A/B test is not the only way to test a hypothesis.

** For low traffic: ** Five-second test (the user looks at the screen for 5 seconds and says what he remembered), tree test (navigation check), first-click test.

**A usability test with 5 users will reveal 80% of problems in 2 days. It does not give statistics, but it gives an understanding of why.

**Shadow Test: Show a new design to a small percentage of real users, but without full A/B analysis.

** Fake door test - add a feature button to the interface that is not, and see how many people click. If you need a lot, you need a feature.


A/B test algorithm: cheat sheet

plaintext
1. Hypothesis
[Observation] → [Reason] → [Decision] → [Expected effect]

2. METRICA
One primary + 1-3 guardrail metrics

3. Size of the election
Calculate through the calculator before launch
Baseline CR + MDE + power 80% + α 0.05

4. Launch.
50/50 split, random split
One user = one option

5. Duration
Not less than the calculated amount.
At least 2 full weeks
Do not look at intermediate results

6. Analysis.
p-value < 0.05?
Practically significant effect?
Guardrail metrics intact?
Segmentation?

7. Decision
Won B → implement + write output
There is no difference in the new hypothesis
You lost, you know why.

8. Documentation
Hypothesis + result + explanation → in the knowledge base

When the test is "dishonest": red flags

  • Sample size is not calculated in advance
  • Test stopped 'because the winner is already visible'
  • Metric selected after launch or modified in the process
  • Results are selected from several metrics (the winner is the one who showed the desired result)
  • The test took less than 7 days
  • Traffic redistributed in the process
  • No breakdown by segment

If any of this is a dishonest test. The result cannot be used as a basis for a decision.


How to Create an A/B Testing Culture in a Team

The A/B test is not a one-off experiment. It's a way of thinking. Teams that have a high rate of experimentation consistently outperform competitors because they learn faster.

Amazon conducts thousands of A/B tests a year. It’s not a dedicated team of “testers” — it’s a culture where every change is an experiment, not an act of faith.

What it looks like in practice

Velocity as a metric. Teams with a high testing culture track not only test results, but also the number of tests. More tests = more opportunities to improve.

**Learning repository. ** All tests, including failed ones, are documented. After a year, it turns into an invaluable knowledge base: "We've already tried to remove the Phone field - conversions dropped because our users want a call.".

Democracy. Not only an analyst can run a test. A designer with a hypothesis goes to the analyst with instructions, rather than waiting in line. Hypothesis template + launch checklist lower the barrier.

No-blame culture. A test that doesn't work is not a failure. It's knowledge that wasn't there before the test. Teams where "bad test score = someone screwed up" stop experimenting with risky hypotheses.


Prioritization: Which Hypotheses to Test First

Ideas for tests are always more than resources. We need a prioritization system.

ICE framework

ICE = Impact × Confidence × Ease

Each hypothesis is evaluated by three parameters from 1 to 10:

  • Impact: How big is the effect on success
  • Confidence: How sure is it going to work
  • Ease: How easy is it to run? (technically and resourcefully)

Multiply three numbers, get priority. Start with the high ones.

** Example:**

Гипотеза Impact Confidence Ease ICE Score
Убрать поле «Телефон» из формы 7 8 9 504
Добавить социальный proof на лендинг 6 7 8 336
Полностью переделать onboarding 9 5 3 135
Изменить цвет кнопки 3 4 9 108

Start with the first two - high ICE and really run.

Additional Filtering: Strategic Importance

ICE is a tactical tool. Sometimes you need to test not what is easier, but what is more strategic.

If the business priority is to lower the CAC, even a low Ease test that potentially improves landing is more important than a light test elsewhere in the funnel.


Types of Tests: Beyond Classical A/B

Classic A/B (one element, two variants) is not the only format.

Multivariate Test (MVT)

Test several elements at the same time and all their combinations.

** When to use: * When to check the interaction of elements For example, how the title and the image affect each other.

**Problem: * You need much more traffic (the number of options is growing multiplierly). For 3 header options × 3 image options = 9 combinations.

Split URL test

Two different pages on different URLs. A more radical change than replacing one element.

When to use:** to test different landing concepts, different onboarding flow, different page structures.

Bandit algorithm

Instead of a fixed 50/50 split, the system automatically redistributes traffic to the winning option.

Plus: Minimizes the loss of a worse working version. Minus: It is more difficult to interpret the results statistically. Suitable for quick decisions, not for fundamental conclusions.


When to stop the test early (and when it is permissible)

The standard rule is: “Do not stop the test until you get the estimated sample size.” But there are exceptions.

Permissible to stop earlier:

  • Option B is clearly detrimental, with conversions down 50%+, users complaining, and key flow breaks. In this case, continuing the test is unethical and unprofitable.
  • An external event made the data irrelevant: a payment system failure, a viral publication, a technical incident. Data compromised - the test must be stopped and restarted.

Not allowed to stop before:

  • "I think we're seeing results."
  • Management pressure ("we need a decision by Friday")
  • Intermediate p-value < 0.05 until the calculated sample is reached

If you need a Friday decision, the A/B test is not appropriate. Use an expert assessment (PURE) or usability test.


Meta-learning: What to do when most tests show no effect

The standard picture in mature foods: 20-30% of tests show a significant effect. The others don't.

It's normal. A more mature product is harder to improve - all the "obvious" problems have already been fixed.

What do you do

  • Work with larger hypotheses. Small changes (button color, wording) are less and less effective. Hypotheses of the level “change the structure of onboarding” or “test another value proposition” are needed.
  • Look for insights in segments. A test with no overall effect can have a strong effect on a particular segment. Break the results.
  • When A/B tests are silent, it is a signal that deeper research is needed: interviews, usability tests, customer journey analysis.
  • Reconsider the metric. Maybe the metric is wrong. You test the click rate on a CTA, but what's really important is converting to payers in 30 days. Improving the click rate without linking to the actual result is a false victory.

AI and A/B Tests: How Claude Helps at Each Stage

The A/B test goes through several stages – hypothesis, calculation, analysis, decision. At each AI, it speeds things up.

Prompt: to formulate the hypothesis correctly

plaintext
I noticed: [observation from data or research]

Help formulate the A/B hypothesis by structure:
[Observation of the problem] → [Proposed cause] → [Proposed solution] → [Expected measurable effect]

Also suggest:
- Primary test metric
2-3 guardrail metrics (which cannot be worsened)
- Potential risks of the hypothesis (why it may not work)

Prompt: Calculate sample size and test duration

plaintext
I need to calculate the parameters of the A/B test:

Current baseline conversion rate: [%]
Minimum Interesting Effect (MDE): [%] (e.g., want to raise from 5% to 6% – MDE = 20%)
Statistical power: 80% (standard)
Level of importance: 0.05 (standard)
Current daily traffic per test: [number of users]

Calculate:
1. Required sample size for each group
2. Minimum duration of the test in days
3. What impact does it have if I want to catch a smaller effect (e.g. +0.5% instead of +1%)?
4. Is it worth the test at all for this traffic, or is it better to use another method?

Prompt: Analyze the test results

Upon receipt of the data:

plaintext
Results of the A/B test:

Option A (control):
- Users: [number]
- Conversions: [number]
- Conversion rate: [%]

Option B (treatment):
- Users: [number]
- Conversions: [number]
- Conversion rate: [%]

Calculate:
1. Statistical significance (p-value)
2. The size of the effect (relative increase in %)
3. Confidence interval for difference
4. Is the result statistically significant at α = 0.05?
5. Is the size of the impact meaningful to our business?

Make a recommendation: Implement B, leave A, or need more data?

Prompt: conduct post-hoc analysis by segment

plaintext
Test complete:

Now help analyze the results by segment.

We have a breakdown:
- Mobile: [Data A and B]
- Desktop: [Data A and B]
- New users: [data]
- Returnees: [data]

For each segment:
1. Is there a statistically significant difference?
2. If the overall result is insignificant, is there a segment where B clearly wins?
3. Is heterogeneous treatment effect (B good for one segment, bad for another)?

What does this mean for the implementation decision?

Prompt: document the test for the knowledge base

plaintext
Help to draw the results of the A/B test in the form of a card for the team’s knowledge base.

Test data:
- Hypothesis: [formulation]
- What was tested: [description of the change]
- Period: [dates]
- Sample: [numbers]
Result: [Won A/B/No Difference]
Statistics: [p-value, effect size]

Make it like a card with sections:
Hypothesis | Change | Metric | Result | Conclusion | Next step

The conclusion should explain not only WHAT Happened, but WHY (or most likely why).
$ cd ../ ← back to Basics of VibeDesign