The right way to test subject lines, openers, and CTAs -- with minimum sample sizes, the metric that actually matters, and what most teams get wrong.
Rees Bayba
Founder, Astra GTM
TL;DR
Most cold email A/B tests produce false confidence rather than real learning. Teams change two things at once, declare a winner at 40 sends, and bake in bad conclusions that hurt performance for months. Done correctly, A/B testing is how you compound reply rate improvements over time -- each valid test gets you closer to a sequence that converts reliably.
If Variant A has a different subject line and a different opener than Variant B and Variant A wins, you do not know why. Was it the subject line? The opener? The combination? You cannot tell. And you cannot apply the learning to future campaigns because you do not know which element to keep.
One variable at a time. This is slower -- it takes more tests to improve a full sequence -- but the learning is real and transferable. Every test that changes two variables simultaneously is a test that teaches you nothing reliable.
Not all variables are equal. Some have much larger impact on reply rate than others. Test in this order to get the most learning per test:
Below 200 sends per variant, the confidence interval on a 3% reply rate is so wide that a difference of 1-2 percentage points is indistinguishable from random variation. Most teams declare winners at 50 sends. This is the single most common testing mistake in cold outbound.
Reply rate is the only metric that matters for cold email A/B testing. Open rate is unreliable because Apple Mail Privacy Protection (MPP), introduced in September 2021, pre-fetches emails and marks them as opened even if the recipient never looks at them. In many B2B lists, 40-60% of open events are machine-generated by Apple's servers, not by humans reading your email.
Click rate is also a weak signal for cold outbound. Most cold emails do not include links (links hurt deliverability and click-through is rare in single-email sequences). Optimize for reply rate.
Secondary metric: positive reply rate. A campaign that generates a 5% reply rate where 10% of replies are interested is worse than a campaign that generates a 3% reply rate where 50% of replies are interested. Track both metrics -- total reply rate (did the email generate a response?) and positive reply rate (did the response indicate any interest in talking further?).
To be 95% confident that a difference in reply rates is real (not random variation), use a chi-square test or an online A/B significance calculator. Input the number of sends per variant and the number of replies per variant. The calculator tells you whether the difference is statistically significant at your desired confidence level.
The practical implication: at 200 sends per variant with a 3% vs. 5% reply rate (6 vs. 10 replies), you do not yet have statistical significance. You need more sends. At 500 sends per variant, a 3% vs. 5% difference is statistically significant at 95% confidence. At 1,000 sends per variant, a 2% vs. 3% difference becomes detectable.
For most teams running campaigns of 200-1,000 contacts, you can only reliably detect differences of 2+ percentage points in reply rate. Do not try to optimize sub-1% differences -- they are invisible at typical campaign sizes.
Here is how a properly structured subject line test looks in practice:
What you learned: The curiosity-gap + competitor proof format outperformed the direct personalized question format in this ICP. Document this. Run a follow-up test to confirm it holds in the next campaign. Once it holds across two campaigns, you have a reliable insight that improves all future sequences targeting this ICP.
How many people do I need in each variant?
Minimum 200 sends per variant. This is the floor for detecting a meaningful difference in reply rates at typical cold email performance levels (2-6%). Below 200 sends, the result is statistically unreliable -- a few extra replies in either direction can swing the apparent winner. If your campaign size does not allow 200+ sends per variant, you do not have enough data to run a valid test.
Should I test open rate or reply rate?
Reply rate. Open rate has been unreliable since Apple Mail Privacy Protection launched in September 2021. MPP pre-fetches emails on Apple devices and marks them as opened regardless of whether the human recipient ever reads them. Depending on your list, 40-60% of open events may be machine-generated. Reply rate measures whether a real person engaged with your email -- it is the only metric that reflects actual human behavior.
How long should I run the test before declaring a winner?
7-10 days from the send date, and only after reaching 200+ sends per variant. The first condition is about timing -- some people reply 5-7 days after receiving an email. The second is about sample size. If you hit 200 sends per variant and 7 days have passed, you can close the test. If you have not reached 200 sends per variant, keep sending before declaring a winner.
Can I test more than one variable at a time?
No. Testing two variables simultaneously means you cannot attribute the result to either change. If Variant A has a different subject line and a different opener than Variant B and Variant A wins, was it the subject or the opener? You cannot tell. And you cannot apply the learning to future campaigns. One variable per test. Always.
What if both variants perform the same?
A null result is still a result. If Variant A and Variant B produce the same reply rate after 200+ sends each, the variable you tested does not have a meaningful impact on performance at this sample size. Document it, move on to testing a different variable, and return to the original variable later with a more extreme difference between variants (e.g., a more dramatically different opener angle rather than two similar angles).
How do I know when I have found a winning formula?
When the same patterns win consistently across 3-5 campaigns targeting the same ICP, you have reliable signal. A single test win may be luck. Two wins in a row is encouraging. Three or more in the same direction across different campaigns is a real pattern. At that point, bake the winning elements into your default sequence for that ICP and stop testing them -- redirect testing effort toward variables that have not yet been optimized.
We implement these systems end-to-end. First sends within 14 days.