I was also disturbed that the effect size was taken into account in the sample size selection. You need to know this before you do any type of statistical test. Otherwise, you are likely to get "positive" results that just don't mean anything.
OTOH, I wasn't too concerned that the test was a one-tailed test. Honestly, in a website A/B test, all I really am concerned about is whether my new page is better than the old page. A one-tailed test tells you that. It might be interesting to run two-tailed tests just so you can get an idea what not to do, but for this use I think a one-tailed test is fine. It's not like you're testing drugs, where finding any effect, either positive or negative, can be valuable.
I should also note that I only really know enough about statistics to not shoot myself in the foot in a big, obvious way. You should get a real stats person to work on this stuff if your livelihood depends on it.
#1 - “Optimizely encourages you to stop the test as soon as it reaches ‘statistical significance.’” - This actually isn’t true. We recommend you calculate your sample size before you start your test using a statistical significance calculator and waiting until you reach that sample size before stopping your test. We wrote a detailed article about how long to run a test, here: https://help.optimizely.com/hc/en-us/articles/200133789-How-...
We also have a sample size calculator you can use, here: https://www.optimizely.com/resources/sample-size-calculator
#2 - Optimizely uses a one-tailed test, rather than a 2-tailed test. - This is a point the article makes and it came up in our customer community a few weeks ago. One of our statisticians wrote a detailed reply, and here’s the TL;DR:
- Optimizely actually uses two 1-tailed tests, not one.
- There is no mathematical difference between a 2-tailed test at 95% confidence and two 1-tailed tests at 97.5% confidence.
- There is a difference in the way you describe error, and we believe we define error in a way that is most natural within the context of A/B testing.
- You can achieve the same result as a 2-tailed test at 95% confidence in Optimizely by requiring the Chance to Beat Baseline to exceed 97.5%.
- We’re working on some exciting enhancements to our methodologies to make results even easier to interpret and more meaningfully actionable for those with no formal Statistics background. Stay tuned!
Here’s the full response if you’re interested in reading more: http://community.optimizely.com/t5/Strategy-Culture/Let-s-ta...
Overall I think it’s great that we’re having this conversation in a public forum because it draws attention to the fact that statistics matter in interpreting test results accurately. All too often, I see people running A/B tests without thinking about how to ensure their results are statistically valid.
Dan
But, as far as "Optimezely encourages you to stop the test as soon as it reaches 'statistical significance,'" I'm not saying your user documentation or anything encourages people to stop tests early. I'm saying (and this is based only on the article as I've never used Optimizely) that your platform is psychologically encouraging users to stop tests early. E.g. from the article:
Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.
<image with a green check mark saying "Variation 1 is beating Variation 2 by 18.1%">
But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more.
I am aware of literature in experimental design that talks about criteria for stopping an experiment before its designed conclusion. Such things are useful in, say, medical research, where if you see a very strong positive or negative result early on, you want to have that safety valve to either get the drug/treatment to market more quickly or to avoid hurting people unnecessarily.Unless you've built that analysis into when you display your "success message" that "Variation 1 is beating Variation 2 by 18.1%," I'd argue that you're doing users a disservice. When I see that message, I want to celebrate, declare victory, and stop the test; and that's not what you should encourage people to do unless it's statistically sound to do so.
The other thing in the article that lead me to this position is that you display "conversion rate over time" as a time series graph. Again, if I see that and I notice one variation is outperforming the other, what I want to do is declare victory and stop the test. That might not be mathematically/statistically warranted.
IMO, as a provider of statistical software, I think you'd do your users a service to not display anything about a running experiment by default until it's either finished or you can mathematically say it's safe to stop the trial. Some people will want their pretty graphs and such, so give them a way to see them, but make them expend some effort to do so. Same thing with prematurely ended experiments; don't provide any conclusions based on an incomplete trial. Give users the ability to download the raw data from a prematurely ended experiment, but don't make it easy or the default.
No, it's the other way around. One tailed test is only usable for testing if the new design worse than the old one, because it being better than the old one does not matter as long it's not worse. If you are testing that is the new design better, you definitely need to test both tails or else you may likely switch to a worse design than the old one.
The key point here is that you aren't choosing a testing procedure, you are choosing a decision procedure.
All users who use SumAll should be wary of their service. We tried them out and we then found out that they used our social media accounts to spam our followers and users with their advertising. We contacted them asking for answers and we never heard from them. Our suggestion: Avoid SumAll.
Best, Jacob
You are also free to toggle that feature off, and should.
It's opt out isn't it.
I couldn't imagine a worse target audience to use that line on.
"What threw a wrench into the works was that SumAll isn’t your typical company. We’re a group of incredibly technical people, with many data analysts and statisticians on staff. We have to be, as our company specializes in aggregating and analyzing business data. Flashy, impressive numbers aren’t enough to convince us that the lifts we were seeing were real unless we examined them under the cold, hard light of our key business metrics."
I was expecting some admission of how their business is actually different/unusual, not just "incredibly technical". Secondly, I was expecting to hear that these "technical" people monkeyed with the A/B testing (or simply over-thought it) which got them in to trouble .. but no, just a statement about how "flashy" numbers don't appeal to them.
I think the article would be much better without some of that background.
Wow. Cool explanation of one-tailed, two tailed tests. Somehow I have never run across that. Here's a link with more detail (I think it's the one intended in the article, but a different one was used): http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests...
Here's the thing, stop A/Bing every little thing (and/or "just because") and you'll get more significant results.
Do you think the true success of something is due to A/B testing? A/B testing is optimizing, not archtecting.
I think the problem is two-sided: one on the part of the tester and one on the part of the tools. The tools "statistically significant" winners MUST be taken with a grain of salt.
On the user side, you simply cannot trust the tools. To avoid these pitfalls, I'd recommend a few key things. One, know your conversion rates. If you're new to a site and don't know patterns, run A/A tests, run small A/B tests, dig into your analytics. Before you run a serious A/B test, you'd better know historical conversion rates and recent conversion rates. If you know your variances, it's even better, but you could probably heuristically understand your rate fluctuations just by looking at analytics and doing A/A test. Two, run your tests for long after you get a "winning" result. Three, have the traffic. If you don't have enough traffic, your ability to run A/B tests is greatly reduced and you become more prone to making mistakes because you're probably an ambitious person and want to keep making improvements! The nice thing here is that if you don't have enough traffic to run tests, you're probably better off doing other stuff anyway.
On the tools side (and I speak from using VWO, not Optimizely, so things could be different), but VWO tags are on all my pages. VWO knows what my goals are. Even if I'm not running active tests on pages, why can't they collect data anyway and get a better idea of what my typical conversion rates are? That way, that data can be included and considered before they tell me I have a "winner". Maybe this is nitpicky, but I keep seeing people who are actively involved in A/B testing write articles like this, and I have to think the tools could do a better job in not steering intermediate-level users down the wrong path, let alone novice users.
Optimizely actually has a decent article on it: https://help.optimizely.com/hc/en-us/articles/200040355-Run-...
They said they ran an A/A test (a very good idea), but the numbers seem slightly implausible under the two tests are identical assumption (which again, doesn't immediately imply the two tests are in fact different).
The important thing to remember is your exact significances/probabilities are a function of the unknown true rates, your data, and your modeling assumptions. The usual advice is to control the undesirable dependence on modeling assumptions by using only "brand name tests." I actually prefer using ad-hoc tests, but discussion what is assumed in them (one-sided/two-sided, pooled data for null, and so on). You definitely can't assume away a thumb on the scale.
Also this calculation is not compensating for any multiple trial or early stopping effect. It (rightly or wrongly) assumes this is the only experiment run and it was stopped without looking at the rates.
This may look like a lot of code, but the code doesn't change over different data.
Of course, if you're a startup, building an A/B testing tool is your last priority, so you would use an existing solution.
Are there much more advanced 'out-of-the-box' tools for testing out there besides the usual suspects, i.e. Optimizely, Monetate, VWO, etc.?
It seems a mod (?) changed it to "Winning A/B results were not translating into improved user acquisition".
I've seen a descriptive title left by the submitter change back to the less descriptive original by a mod. But I'm curious why a mod would editorialize certain titles and change them away from their original, but undo the editorializing of others and change them to the less descriptive originals.
FWIW the change to this headline seems like the right decision to me.
I don't understand this paragraph. They only look for indications that the drug is better... than what?
"They make it easy to catch the A/B testing bug..."
Optimizely is best suited at creating exciting graphs and numbers that will impress the management, which I guess is a more lucrative business than providing real insight.