Email A/B Testing Guide: What to Test and How to Read Results

Tarık Tunç
Mar 12, 2025
5 min read

Email A/B testing is the practice of sending two versions of an email to different portions of your list to determine which performs better on a defined metric. Done correctly, it removes guesswork from email decisions and builds a compounding understanding of what your specific audience responds to. Done poorly, it produces misleading results that lead to worse decisions than intuition alone.

This guide covers what to test, how to set up tests that yield valid data, and how to read results without falling into the most common interpretation errors.

⠀

Why Email A/B Testing Requires More Care Than Most Marketers Give It

⠀

The concept is simple: send Version A to half your list, Version B to the other half, see which wins. The execution is where most tests fail.

Email A/B testing produces valid results only when one variable changes at a time, the sample is large enough to be statistically meaningful, the test runs long enough to capture a representative audience, and the winning metric matches the actual business goal. Break any of these conditions and the result is noise, not signal.

The most common mistake is declaring a winner after a few hours based on an open rate difference of 1–2 percentage points. This level of difference is within normal statistical variation and could easily reverse over the next 24 hours or with a different segment of your list.

⠀

What to Test in Email A/B Testing

⠀

Different variables affect different metrics. Match what you test to what you are trying to improve.

Subject lines → test to improve open rate

Subject line is the single highest-impact variable for open rate. Test one element at a time: length, question vs. statement, number vs. no number, personalization vs. generic, urgency vs. curiosity.

Example test:

Version A: "5 Google Ads settings worth reviewing this week"
Version B: "Are your Google Ads campaigns wasting budget?"

⠀

From name → test to improve open rate and trust

"Blakfy" vs. "Sezer at Blakfy" vs. a personal name. For B2B audiences, emails from a named person often outperform brand-name sends. Test once; the winning format typically remains consistent.

Preview text → test to improve open rate

Preview text functions as a second subject line. Test whether extending the subject line vs. adding complementary information vs. using a question format produces higher opens.

CTA copy and placement → test to improve click-through rate

"Book a free audit" vs. "See how it works" vs. "Get started." CTA copy changes often produce the largest click-through rate differences of any variable tested.

Email length → test to improve click-through and conversion rate

Short (under 200 words) vs. long (600+ words). Results vary significantly by audience and email type. B2B nurture emails often perform better long; promotional emails often perform better short.

Send time → test to improve open and click rate

Tuesday 9 AM vs. Thursday 11 AM. Send time effects are real but often smaller than expected. Test across multiple sends before drawing conclusions.

⠀

Email A/B Testing: How to Set Up a Valid Test

⠀

Step 1 — Change only one variable

Every element that differs between Version A and Version B is a variable. If you change both the subject line and the CTA copy, you cannot know which change drove the performance difference. One change per test.

Step 2 — Determine your sample size

As a general rule, each variation needs at least 1,000 recipients to produce statistically meaningful results for open rate. For click-through rate, which occurs less frequently, you need a larger sample — closer to 2,000–3,000 per variation. Lists smaller than 2,000 contacts should test subject lines only and accept lower confidence in results.

Most email platforms calculate statistical significance automatically. Look for 95% confidence before declaring a winner.

Step 3 — Define the winning metric before you send

Decide in advance what you are measuring — open rate, click-through rate, conversion, or revenue. Testing for open rate and then switching to click-through rate when the open rate result is inconclusive is post-hoc rationalization, not testing.

Step 4 — Set the test duration

Run the test for at least 4–6 hours before checking results. For most lists, 24 hours captures enough variation in send-time behavior to produce representative data. Checking results after 30 minutes and acting on early open rate differences is the most common source of misleading A/B test conclusions.

Step 5 — Send the winner to the remainder

Most platforms allow you to split 20% of your list into the A/B test and send the winner to the remaining 80% automatically. This approach maximizes both the validity of the test and the performance of the final send.

⠀

How to Read Email A/B Testing Results

⠀

Absolute difference vs. relative difference

A 25% open rate vs. a 22% open rate is a 3 percentage point absolute difference and a 13.6% relative difference. Both can be accurate descriptions of the same result. Relative differences look more dramatic — be aware of which framing your platform uses in its reporting.

Statistical significance

A result is statistically significant when the probability that it occurred by random chance falls below a defined threshold (typically 5%). Your platform will show a confidence score or p-value. Results below 95% confidence are not reliable enough to act on — run the test again with a larger sample.

Practical significance

A result can be statistically significant but practically meaningless. If Version A produces a 30.1% open rate and Version B produces a 30.4% open rate with 99% confidence, the test is valid — but the real-world impact of that 0.3 percentage point difference is negligible. Look for differences of at least 2–3 percentage points before acting.

Beware of novelty effects

A new subject line format or send time often performs well initially because it is different from what subscribers expect. Test the same change again after 4–6 weeks to confirm the result holds.

⠀

Building a Testing Roadmap

⠀

Random A/B tests accumulate data but not understanding. A structured testing roadmap — testing variables in priority order, documenting results, and building on what you learn — produces compounding knowledge about your audience.

A simple roadmap approach:

Month 1–2: Test subject line length and format — establish what opens best
Month 3–4: Test CTA copy and placement — establish what clicks best
Month 5–6: Test email length — establish what converts best
Month 7+: Test more nuanced variables based on what the first cycle revealed

⠀

Document every test: hypothesis, variables, sample size, result, confidence level, and what you decided to do. This record becomes a competitive asset — a playbook of what works for your specific audience that no competitor can replicate.

Blakfy manages data-driven email programs for businesses that want testing to drive decisions rather than gut instinct alone.

⠀

Frequently Asked Questions

⠀

How large does my list need to be to run meaningful A/B tests?

For subject line tests targeting open rate, aim for at least 500 contacts per variation (1,000 total). For click-through rate tests, aim for 2,000 per variation. Below these thresholds, results are too variable to be reliable. Smaller lists should still test, but treat results as directional rather than conclusive.

Can I test multiple variables at the same time?

Only with multivariate testing, which requires a significantly larger list and more sophisticated analysis. For most email programs, stick to one variable per test. The simplicity of one-variable testing is a feature, not a limitation — it produces clear, actionable results.

What is the most important thing to A/B test first?

Subject lines, because they affect whether the email gets opened at all. No amount of excellent body copy, CTA optimization, or personalization matters if the email is not opened. Establish your best-performing subject line formula before optimizing anything else.

My winning version only beat the control by 1%. Should I implement it?

Not unless you have very high statistical confidence (99%+) and a very large sample size. A 1 percentage point difference is within normal variation for most list sizes and may not hold on the next send. Replicate the test before making permanent changes based on small differences.