Understanding A/B Test Results
When you have drawn conclusions from an A/B test, you can deploy some or all the experiences for its test groups to the site.
Typically, you want to deploy the experiences of the winning test group, but sometimes you want to deploy experiences for multiple groups. Deploy these experiences manually by creating a campaign or adding to an existing one. Before drawing conclusions, however, determine how long tests run, when they end, and how you know that the tests have run long enough. Key concepts to understand are confidence level and statistical significance.
Statistical Significance
The point at which a test reaches statistical significance depends on multiple parameters. Typically the more similar your test subjects, the more metrics you test. The lower your traffic, the longer it takes to reach statistical significance. But what is statistical significance?
In statistics, a result is considered statistically significant if it's unlikely to have occurred by chance. For example, youβre running an A/B test that included 500 customers. The test showed that the average order value for September was 30% higher with an autumn-colored banner than with a dark-colored banner. Though the results can be statistically significant, is the difference important? Tests of significance should always be accompanied by effect-size statistics, which approximate the size and thus the practical importance of the difference. The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the confidence level.
You can compare the metrics and values of the control and test groups on the A/B Testing page. B2C Commerce calculates the confidence level, which indicates the likelihood that these differences are because of your change in site experience and not random chance. When the confidence level reaches 90%, it's deemed a statistically significant result.
In the banner-color example, if the B2C Commerce-computed confidence level reaches 90%, the test result can be considered statistically significant. The merchandising team can use an autumn-colored banner in September, with a high degree of confidence that it drives better results than a dark-colored banner.
Test Length
How long to run a test depends on your average number of daily visitors, the percent of visitors included in a test, and other external factors. In general, run it until the test reaches statistical significance, or when it's clear that it won't. Understand, though, that a B2C Commerce A/B test can run for up to 90 days.
An A/B test automatically ends on its end date unless it's disabled by a user. An email is sent to recipients configured on the A/B test if it reaches a 95% confidence level for its key metric. The test continues to run after the email is sent, so it's possible for the confidence level to dip below 95% after the email sends. The email only sends once, to avoid duplicates from sending if the confidence level changes from 95% to 94% and then back over 95%. You can deploy segment experiences even if a 95% confidence level isn't reached. For example, you can deploy for a confidence level of 90%, 85%, and so on.