There are three things that really irritate me about A/B testing. The first is where people fool themselves by drawing conclusions from too little data. The second is the myth that small changes frequently result in large improvements and the final one is when A/B tests are used to predict an actual percentage improvement when the data just isn’t there.
You Need a Lot of Data
We do a lot of A/B testing at WhatClinic.com and we like to think we know a little bit about the topic. We recently ran A/B test where we put a section of instructional text at the top right hand side of the page. After 11,000 tests and 400 conversions it clearly showed that the instructions made a 30% difference. It would have been so easy for us to stop there and pop open the champagne and boast about how changing one little thing improved our bottom line by 30%.
But we didn’t, we kept the test running, because experience has told us not to draw conclusions too quickly. We let the test run on for another 90,000 people and 3,000 conversion and you know what. In the end it turns out that there was no substantial difference between the two. That’s right no difference.
The whole point of A/B testing is to learn. Learn what works and what doesn’t work. If you don’t run your tests over a large enough sample size then there is a good chance you are going to learn a fallacy. Not only won’t you be moving forward but you will actually be moving backwards and decreasing the value of your company.
So what if you don’t have the traffic to do A/B tests? Well don’t do them. Do user testing. Get people in and ask them to use your product. You’re going to get a lot more information a lot faster and have a higher degree of confidence in the results.
Small Tweaks rarely makes Substantial Differences
I read about these all the time. You know the type of story – “I changed the colour of a button and increased conversion by 25%”. They read great and play into a pleasant dream that riches and fortunes are just a colour change away. However, in my experience small tweaks have never made a substantial difference to conversion.
It should come as no surprise to you that in order to substantially change user behaviour you need a substantial change to the site. This doesn’t mean that it never happens. However, I suspect that it happens rarely and the bulk of the time it is reported on blog and forums that it is the result of drawing conclusions from too little data or just plain old link baiting. Unfortunately the truth is normally all too boring.
A/B test don’t tell you how much better one page will be over another page
A really common misconception is to think that A/B testing can show you how much better one version of a page will perform over a different version of the page. IT CANNOT. A/B testing can only give you a confidence rate of whether one page is better than another and the observed historic improvement.
Highly advanced A/B testing can tell you a confidence rating of whether there will be a 5% improvement or a 10% improvement, etc, but it cannot tell you what the actual improvement will be. Too often people are fooled into thinking that just because they have observed a 30% improvement during the test that there will be a 30% improvement in the future. Whereas the actual results of the test is that version A has a 93% chance of being better than version B – note no prediction of how much better
Let me know of any examples you have where A/B test have first shown one thing then the other. I know James Kennedy from voiceover Ireland has one on his blog here
Correction
It has been pointed out to me that the above example only shows a 20% improvement, not a 30% improvement. Sorry for the mistake















Good post Caelen.
Couple points:
1. Looking at the GWO screenshot you can clearly see that the intervals overlapping:
Original 3.25% (+/-0.3%)
Challenger 3.9% (+/-0.3%)
There was no 30% uplift. The original could have been 3.55% (3.2+0.3) and the challenger 3.6% (3.9-0.3) with a confidence level of 97%. You need a much wider band than that.
2. You’re quite right that small tweaks generally don’t give large variances from the control. But there are many exceptions – I’ve seen cases where small tweaks respond with large variances from the control. It’s rare, but you’d be amazed.
Your third point is absolutely spot on – you can only say that the observed improvement was V% with X% confidence that variation Y will perform better than variation Z. This is lost on most folk who start using Conversion Rate Testing.
Rgds
Richard
Thanks Richard. I’m still dreaming of that small tweak that will make a dramatic difference. I’m sure they happen – however I’m also convince they happen a lot less often then is reported
Caelen,
Great post – and I agree with Richard that there are certainly bankable gains to be had from small changes.
Looking through your site for a moment I would suggest that you experiment with the following:
a) Underline all of your CSS links. This is a basic, old school UI test – people may be mistaking links for non-links, getting frustrated and leaving. We’ve seen as much as 10 – 15% bankable gains from that alone.
b) The wording and postion of your “Contact Clinic” button and the ads for clinics. We’ve tested accross millions and millions of uniques in a variety of industries, try moving your contact buttons & ads to the left side of the page – check this attention heatmap of your site for instance http://bit.ly/aI9AaP – your “Contact Clinic” button is in the “dead zone” in terms of horizontal attention.
Hope those suggestions are of some moderate assistance, if you decide to implement drop me a line and let me know how they go!
best,
Zack
http://www.ConversionVoodoo.com/blog/
Hey Caelen,
Interesting post. I agree, it’s something of a false God. I hear clients all the time ask “can we test X over Y”, it’s never a great idea. Most new site simply don’t have the budget to generate multiple designs, the traffic to drawing statistically significant conclusions, or most importantly the design insight to work out why one design works over another.
Two interesting posts on A/B testing
1. It’s not just about micro changes
http://blog.performable.com/why-ab-testing-isnt-just-about-small-changes/
2. Bogus conversions only fool yourself
http://www.uie.com/brainsparks/2010/08/16/are-we-measuring-the-wrong-assumptions/
This second one is something that I’ve seen social media agencies rely on. “We’ll get you 2,000 twitter followers, or 300 Facebook fans. You’ll see a 15% increase in site traffic”.
As with any of these things a large quantity of bull is easily beaten by a small number of paying customers.
Des
Spot with the links Des and thanks for stopping by.
Regarding the bogus conversions, this is something we have had an issue with in our testing. While we have found the google add-in for Salesforce very useful in terms of tracking quality leads and their source, it does not send GWO data as well, which is a real shame as this would allow quality and quantity to be measured in one shot.
Partially because of our concerns with quality of lead, we don’t always use Optimiser for our multivariate testing, rather send a session id label for each conversion to salesforce and qualify the leads there. If we are overcomplicating something that has a simpler solution, I’d love to know!
The kick in the nuts for us in this article is the volume of conversions required to make a valid decision. Ouch …
Nice post.
I do have one question though, does your A/B testing track unique visitors across locations? (This is somewhat of a rhetorical question… I am unaware of a method or way of doing this, if you know one please do share!) I ask because, what if a user happens upon the site while at location A (i.e. work, or an coffee shop with wifi) and is presented the page with instructions and because of them are “sold”, however, they don’t want wish to continue with the conversion action at that time but instead elect to return to the site at location B (i.e. home) to continue on. What if this time they are presented with the non-instructions page? Since there is no way to link or the two visits to the same user and then determine which page they were originally presented with then you lose the information that users that are presented page A or B on their first visit are X% (with Y% confidence) more likely to convert.
I know what I describe above my be a bit contrived and may or may not happen often, but letting a sample or test go on too long could just end up tainting your test. No?
Excellent post! You address some points which are becoming more and more relevant with all the test results being published. Many case studies don’t include the actual numbers, so you can’t verify the statistical significance yourself.
> Let me know of any examples you have
> where A/B test have first shown one thing
> then the other.
We see this all the time. We’ve gotten 95%+ statistical significance that version B performed better than A…. then left the test running and end up with 95%+ that the opposite is actually the case. This has happened both in Google Website Optimizer and Visual Website Optimizer.
I’ve seen conversion rate optimization companies giving advice like “you should never conclude a test until it has at least 80% statistical confidence”. Huh?!
One thing I’ve always been curious about: CRO companies that charge by results (for example they get paid if they generate at least 20% more revenue), how do they KNOW that the goal was reached? How is it calculated? At what stat. sig.? Because, as you write Caelen, an AB test doesn’t show HOW MUCH one version is better, just how sure you are that IS better.
> So what if you don’t have the traffic to do
> A/B tests? Well don’t do them. Do user testing.
> Get people in and ask them to use your product.
I would add that even if you do have the traffic to do tests, do the user tests first. Feed the findings into your first tests.
Hi Jens and thanks for the comment. I totally agree on the point of doing user tests first then A/B test to fine tune the details
This is a fantastic post. All too often those in the interactive marketing industry swear by A/B testing as the one of the most solid ways to test the effectiveness of one landing page over another. I think with the points you bring up here, it is obvious that certain assumptions (particularly the ‘butterfly effect’ assumption) have been popularized as the truth without too much analysis or thought put into it.
Your argument that A/B tests show the % confidence that page A is better than page B and not the absolute % improvement of page A over page B is brilliant.
Thanks for your well-reasoned criticism; I’ll keep it in mind the next time I need to test landing pages.
Hi Ivory
I don’t want to give the opinion that I am in anyway negative towards A/B tests. They are an incredibly valuable tool that we continue to use on daily basis. The problem is that they require more data then most people think and they are not a panacea.
Thanks for the comment
[...] WhatClinic blog points out that A/B testing isn’t the holy grail afterall. [...]
[...] in Mexico to plastic surgery clinics in Dublin and everything in between. He regularly blogs about marketing and business issues. Prior to founding WhatClinic.com, Caelen rolled out 10 million+ user base social applications to [...]
First of all I would warn everybody not to use or rely on the graph in GWO.
You’ll get more insight if you set Google Analytics up to record the traffic from your various splits and variations.
The results will usually be a lot different than what GWO is telling you.
Hi Mark,
Are you saying that GWO is generally inaccurate or just that you can see more detail by using Analytics? Do you have any data you can share?