~/wiki / issledovaniya-i-ux-metody / a-b-testy-dlya-dizaynerov

The A/B test destroyed my favorite design. And that was the best lesson

Main chat

A chat for vibe coders: news, guides, live cases, marketplace, and finding executors.

$ cd section/ $ join vibe dev
The A/B test destroyed my favorite design. And that was the best lesson - обложка

I've been drawing this screen for three weeks. Reworked the composition, selected the intonation of illustrations, argued with the product about the length of the title. At the review, the design was praised, in Figma it looked like a case cover. I honestly thought it was the best thing I did in the quarter.

And then we ran the A/B test. And my beautiful, hard-won, "right" option lost to boring control with a difference that can't be attributed to noise. Conversion has slipped, the time to target action has grown, and even in high-quality interviews, users seem not to notice that we have changed anything at all.

It hurt. And it was the best lesson of my career.

Why this failure is more important than ten successful releases

When a test wins, you rarely learn anything new. You confirm your hypothesis, check the box, move on. It feels good, but the muscle doesn't grow.

When a test loses—especially a design you’ve invested emotionally in—your internal model of how a product works breaks down. This model determines the quality of all future decisions. The more often it encounters reality, the more accurate it becomes.

Most designers experience moments like trauma. They defend themselves, look for errors in the setup of the test, accuse developers of the implementation curve, the product of the wrong metric. Sometimes they're right. But more often than not. And then the only way to benefit from that is to go through the design-as-you-go stage and start sorting out exactly what happened.

First mistake: I tested what I liked, not what bothered the user

If you look at my losing option honestly, it turns out that I was not solving a user problem. I was doing visual. I didn't like the density of the old screen, I didn't like the indentations, I didn't like the contrast of the CTA. So I redesigned it, and I packaged it into the "perceptual enhancement" hypothesis.

It's a typical trap. The new design is clearer/cleaner/modern, so conversions will increase. Between “cleaner” and “conversion will grow” is a chasm that the designer mentally jumps over without noticing.

Antipatterns of hypotheses that almost always lose or give zero

  • “Make it modern” – without the specific behavioral problem it solves
  • "Let's remove visual noise" - if the noise interfered with your review, and not the user in the task
  • "Add air" - air does not convert, converts the clarity of the next step
  • “Switch illustrations to more branded ones” is work on brand metrics, not funnels
  • "Rewrite the headline more inspiring" - inspiration rarely moves the click, clarity moves

How to reformulate a hypothesis so that it is honest

A good hypothesis for A/B has three parts: observation, mechanism, expected effect on the metric.

  • Observation: “At the step of choosing the tariff, 40% leave, in the sessions you can see that people do not understand the difference between the plans.”
  • Mechanism: “If you show a comparison table with key differences on top, the user will find a suitable plan faster”
  • Expected effect: “the transition from the choice step to the payment step will increase”

If you don’t have one part, you’re testing the taste, not the product.

The second mistake: the beautiful version broke the habit

The old screen was ugly. But he was familiar. Users who came again, passed it on autopilot: the eye knew where the button, the hand knew the trajectory.

The new design redraws the hierarchy. The CTA moved, the headline got longer, added a new block of "social proof" that I was very proud of. For a new user, perhaps better. For a returnee, a sudden obstacle is where automatism used to be.

Checklist "Have I broken the habit"

  • The position of the main CTA remained or moved to less than the height of the screen
  • The order of steps in the float has not changed
  • Anchor elements (logo, navigation, primary action) remained in their usual places
  • Changes in copy do not change the meaning of the action, only the wording
  • If you break the habit consciously - there is a hypothesis why the gain for new will block the subsidence of returning

Questions to Ask Yourself Before Launching

  • Which user segment do I optimize: new, returning, everyone?
  • How does this segment now pass the screen - on autopilot or thoughtfully?
  • What of the old version I break not intentionally, but as a side effect of the redesign?
  • Am I willing to split the test into segments if the total score is zero?

The short summary of this part is that the design failed not because it was bad, but because I confused “look better” with “works better” and didn’t take into account that users already had a working screen model. Next – about how to read the test results, do not lie to yourself and pull out of losing specific decisions.

How to read the result without adjusting it to yourself

When the test is lost, the brain automatically triggers the lawyer’s mode: looking for something to find fault with in the setup to save the decision. It's not always unfair - sometimes the setup is really broken. But if you start with the question of how to prove that my option actually won, you are no longer an analyst, you are an advocate. A defense attorney in a lost case is a poor source of solutions.

It is more useful to go in the opposite direction: first assume that the result is fair, and only then check whether there are real reasons not to believe it.

What to check before arguing with the numbers

  • Sufficient test power – not “feels enough”, but according to the calculation before the launch
  • Distribution of traffic between groups is smooth, without distortion by platforms, sources, geographies
  • Is there an intersection with another test that ran parallel on the same screens
  • Segments of new and returning are seen separately rather than collapsing into one metric
  • The test metric is the one you declared before the start, not the one that suddenly “looks better.”

If all points are in order, the test is fair. You can only argue with your own hypothesis, not with numbers.

Anti-interpretation patterns that are easy to slip into

  • "Let's wait another week, maybe even out" - after stat significance has already been reached in the minus
  • "And let's just look at mobile, there's my version better" - post-hoc slicing segments before winning
  • “Top level metrics have subsided, but engagement has grown” – replacing the test goal with a convenient one
  • "It's an anomaly due to weekend/sale/weather" - without checking that the control group has the same
  • “The team didn’t realize it that way” – no diff layout and production

How to Discover Loss in Useful Pieces

A lost test is not a zero. It's data. The task is to pull hypotheses out of it that would otherwise have to be bought through the medium.

Layer analysis

It is useful to break down the loss into three layers and see which one worked against you.

Слой Что проверять Сигнал, что проблема здесь
Восприятие первые секунды, считываемость экрана drop на самом первом шаге нового флоу
Поведение клики, скроллы, путь к CTA люди доходят, но не нажимают
Контекст сегмент, устройство, повторный визит проседание только у возвращающихся

Further, under each layer you can get sessions, hitmaps or just go through the flow with your hands from the test account of a beginner and an account with experience - and the difference will be visible.

Questions for a review of a losing layout

  • Where does the user spend more time than the old one?
  • What element have I added for beauty, and can it be removed without losing meaning?
  • Which of the old versions worked as an anchor and I accidentally moved it?
  • If I only made one change out of five, what would be the real user task?

The last question is usually a real test that was worth running from the start.

How to transfer it to the layout the next day

The main practical conclusion is not “test less”, but “isolate change”. A big redesign on A/B is always a mixed signal: something inside worked in plus, something in minus, in total zero or minus, and you don’t know what it is.

A workflow that reduces the chance of such failure

  1. Before the layout – formulate the hypothesis in the format of observation / mechanism / effect. If it does not work, it is taste, not a task.
  2. In the layout, you keep two versions: “minimal change under the hypothesis” and “as I would do if I were given free rein.” Testing the first.
  3. In a review, you specify what is changing for the new user and what is changing for the returning user.
  4. Before the launch, you fix the stop criteria: on what metric and how much you recognize the result without bargaining.
  5. After the test, you write a short analysis even when you win, so as not to fix random luck as “your style”.

Designer’s checklist before sending the layout to the test

  • I can say in one sentence what user behavior should change
  • There are no changes in the layout that do not work on this hypothesis
  • The anchor elements for the returnees have remained in place - or I am consciously moving them
  • The test metric is selected before the start and will not change along the way
  • I know in advance what I will do if the test loses: roll back, refine, cut into segments

The bottom line is that a losing test becomes a lesson only when you allow yourself to believe the numbers before you start challenging them. Next is how to build this habit into the team, so that not every designer goes through this pain alone.

When AI and MCP appear in the team – what changes in the test itself

A designer used to make one layout a week and protect it like a child. Five options can now be assembled through Figma + MCP agent and AI content generation in an evening, and the temptation to “test everything” becomes dangerous. The cheaper it is to produce an option, the more expensive the discipline of the hypothesis becomes.

What AI Really Speeds Up and What Doesn't

  • Accelerates: sketch alternative layouts, rewrite a microcopy under different tones, generate variants of illustrations, collect a prototype with real data
  • Accelerates: make a summary of the loser test on raw unloadings and highlight what step fell
  • Does not accelerate: the formulation of the hypothesis – it is still born out of the observation of the user
  • Does not accelerate: the choice of metrics and stop criteria is a managerial decision, not a generative one

Anti-Patterns of “AI Generous” Testing

  • Run 4 options against control because “MCP still collected” – each extra branch eats traffic and blurs the significance
  • Asking the model “to come up with a hypothesis under the layout” is the reverse order, the hypothesis should be before the layout
  • Let the AI rewrite the entire microcopy and roll into a test without passing the float with your hands
  • Throwing raw personal data of users into chat for the sake of “let’s analyze the sessions” is not a matter of taste, it’s politics

Minimum checklist if AI/MCP is involved

  • I can explain for myself how variants B and C differ in hypothesis rather than picture
  • The generated copy is read aloud and tested for legal/tonal rake
  • The data that went into the model does not contain what should not leave the circuit
  • The MCP agent does not have the right to roll changes to the product without human approval
  • If AI helped with the analysis of the result, I double-checked the numbers in the source, and did not trust the retelling

How to check the quality of the test itself, not just the layout

Losing is not about design, but about a broken experiment. Before burying the option, you should go through technical hygiene.

Short QA list for the test

  • Split is real 50/50 – checked by traffic, not just by configuration
  • Segments at the entrance are the same: new / old, platforms, geo
  • The event for which the metric is considered works in both versions equally reliably
  • There is no leak between groups - the user does not see one option, then the other between sessions
  • The test was long enough to cover a weekly cycle, not three weekdays
  • In parallel, the second experiment did not roll, which touches the same screen

If at least one point is missed, the result can be questioned honestly, not because of resentment.

How to distinguish “design lost” from “test broken”

  • Segmental and time-stable loss is more of a design
  • Losing only one traffic channel and coincides with the release of analytics – rather a tool
  • The control group does not behave as usual, but rather an external factor
  • The metric subsided, but the event in variant B is logged less technically - it's not behavior, it's telemetry

How to explain losing to a team without losing credibility

The most unpleasant thing about the lost test is not the numbers themselves, but a conversation with the product, the developer and the team. If you enter it with protection, trust in the design decreases. Honestly, it's growing.

A Conversation Structure That Works

  1. What we tested was one sentence about a hypothesis, without a layout on the screen
  2. What you saw in the numbers - metric, direction, in which segment
  3. What does it mean for the user - translating numbers into behavior
  4. What we do next: rollback, refinement, new hypothesis
  5. What we learned about the process - one line, so as not to repeat

This is enough for five minutes of stand-up and covers 90% of the “why” questions.

Questions that should be prepared in advance

  • “Can we at least leave a new option for new users?” – the answer depends on whether the segment was included in the test in advance
  • "Maybe we should do it and restart it?" - OK, but with a clear new hypothesis, otherwise it's the same test
  • “Why did we roll this at all, if the result is such?” – here the hypothesis recorded before the start saves
  • "How much did we lose on the test?" - an honest assessment of exposure is better than going sideways

What not to do in the debriefing

  • Blame development without diff layout and implementation
  • Blame analytics without a specific event that is broken
  • Promise “next time will work” is not a conclusion, it’s an emotion
  • Hide the loss in the general status update in small print

The short end of the segment: AI and MCP make the production of variants almost free, which is why the discipline of the hypothesis, the purity of the test and honest conversation with the team become the main skill of the designer. The losing test ceases to be a personal disaster at the very moment when the team is able to disassemble it according to the same scheme.

Checklist before the start of the next test

A lost test ceases to offend when there is a habit of going through the same list before launch. It's not bureaucracy, it's insurance against a situation where two weeks later there's nothing to say to the team other than "well, that's it.".

Before launch

  • The hypothesis is written in the format of “if – then – because” rather than “we want to try”
  • The main metric was chosen before anyone saw the layout
  • There is a guardrail metric that the test should not break (unsubscribe, errors, appeals in support)
  • Decided what is considered success and what is statistical noise
  • We know which segment we expect the most, and why
  • It is agreed that we do with a draw: leave the old, do not “choose more beautiful”

During the test

  • No one edits copy, icons and animations on the road
  • Marketing does not run a parallel promotion on the same screen
  • Event logs checked by hand at least once after the start
  • The test is not stopped ahead of schedule due to the fact that “the trend is already visible”

After the test

  • The result is broken down by segments, not just by total number
  • Enrolled in the general journal of experiments, even if lost
  • A conclusion is made about the process, not just about the layout
  • A rollback or rollback is planned for a specific task, not “should”

Anti-patterns that are easily overlooked

Most failures are not due to poor design, but to dishonest treatment of the process. These traps are convenient to catch before they catch the product.

Anti-Pattern "Test as Protection of the Model"

The designer casts the test because he wants to prove he was right. The hypothesis is tailored to the solution, not vice versa. In the end, the result is either victory without conclusions, or resentment.

Anti-Pattern "Infinite Tuning"

After losing, variant B is renamed B', then B'', and the test is restarted without a new hypothesis. The team is wasting time and the design is the focus.

Anti-pattern "AI generated, I signed"

Ten variants of the model go into the test without a clear explanation of how they differ in meaning. There’s nothing to say except “the button below.” It kills design credibility faster than any loss.

The anti-pattern “won is good”

The metric is up, the test is closed, champagne. After a month, the drawdown in retention comes out because no one was watching guardrail. Winning without a side check is a delayed loss.

Anti-Pattern Designer Out of Discrimination

The analyst sends the sign, the product retells it on stand-up, the designer finds out the result last. Not only is the context lost, but the right to vote in the following hypotheses.

Questions for the outcome review

This short list is convenient to talk aloud at the analysis - with yourself or with the team. He works to win and to lose.

  • What exactly did we learn about the user that we didn’t know before the test?
  • Which part of the result is explained by the design and which by the context (season, release, channel)?
  • If we ran this test again, what would change in the production itself?
  • Which segment did not behave as we expected, and why?
  • What was AI done in this test, and where did we double-check it?
  • What decision are we making right now: rollback, rollback, rework, new test?
  • What one line do we add to the general journal so that we can find it in six months?

If half the questions are not answered, it is not a bad test, it is an incomplete analysis. It is worth returning to him in a day, and not to close the task out of politeness.

Practical outcome

Favorite design, losing in A/B, is not a sentence of taste and not a reason to fix the metric. This is the cheapest way to know that the user does not live in the same coordinate system in which the designer lives. The discipline of the hypothesis before the start, pure telemetry in time, and honest debriefing after are three habits that turn each loss into a team asset rather than a personal injury. AI and MCP in this circuit are only as useful as the person next to them can ask the right questions and not sign what they cannot explain in words.

$ cd ../ ← back to Research and UX Methods