Determining A/B test sample size (opens in new tab)

(37signals.com)

25 pointsnoahnoahnoah14y ago11 comments

11 comments

8 comments · 4 top-level

equark14y ago· 3 in thread

Somebody really needs to write a Bayesian takedown of all these A/B testing articles. A/B testing is a Bayesian decision problem. There's really no other way to think about it. Determining sample size and frequentist confidence intervals are only relevant insofar as they approximate Bayesian concepts.

The issue is the proper tradeoff between exploration and exploitation. What drives the decision is outstanding uncertainty conditional on the data observed (not conditional on the null hypothesis of zero effect and some non-sequential iid sampling process), the discount rate (which is totally absent in this article), and the reward structure (which is not a Type I and Type II error).

The absurdity of the frequentist approach is clear from the admonition not to look at the results of the tests too often.

mturmon14y ago

I think even a Bayesian approach will have to grapple with the issue of looking at the results too often. The problem is that if you make your decision on "when do I stop testing", dependent on the test results so far, then the test results can be biased.

I'm sure you're aware of this, but I'm just trying to clarify the idea for other readers.

The idea is not well-illustrated in the article. (Although the article does provide some usable guidance until the whole Bayesian framework gets built and populated with correct parameters, like the reward structure.)

So, to be concrete -- Suppose you're flipping coins and you figure (by some procedure) you need 100 flips to reach significance. By the 70th flip, you observe that p(head) ~= 40/70 ~= 57%, so you decide to stop the test because clearly you're not dealing with a 50/50 coin. That's not OK, because you'll always see favorable and unfavorable excursions in a series of coin flips -- if you choose to stop in the middle of such an excursion, you'll bias the result. You've made the stopping time dependent on the observed values.

In some situations you can do this (it's related to http://en.wikipedia.org/wiki/Optional_stopping_theorem), but the way that I described above is not one of them.

equark14y ago

No this is actually a common misunderstanding and gets to the heart of the difference between conditioning on the data vs considering the sampling process. At the 70th flip your best guess is that it is 57%, given a uniform prior. It's perfectly fine to stop based on the results you have, that doesn't change the likelihood of seeing what you saw. Imagine looking each time, clearly your best guess is the sample mean unless you have prior knowledge.

What's confusing is thinking about the sampling distribution. But what might have happened in some other world is of no consequence if you condition on the data rather than the parameter.

This is the likelihood principle. http://en.wikipedia.org/wiki/Likelihood_principle. See the example there and how it relates to sequential trials. It's actually rather deep. Other good links are:

http://books.google.com/books?id=_ravDT9e8nMC&lpg=PA17&#...

http://books.google.com/books?id=oY_x7dE15_AC&lpg=PA27&#...

http://projecteuclid.org/DPubS?service=UI&version=1.0...

1 more reply

medinism14y ago

Thank you mturmon. that was super insightful. so if read this correctly (and the theorem) is ok to stop as long as the stopping is not dependent on any other variable (like results of the experiments or time). correct?

1 more reply

bryanh14y ago· 1 in thread

I rarely see many people take into account the opportunity cost of letting a really close A/B test reach 99.99% confidence when the benefit is by definition very marginal (that's why its taking so long, right?). I mean, is it really that bad to go on "close enough" results and move on to bigger and better tests?

hammock14y ago

In academia, we used p values of .95 or greater. I was taught that in business, though, the rule of thumb for making decisions is more like .8, and that's typically the standard I use as well.

DanielRibeiro14y ago

Another way to see this, is to use this online calculator: http://visualwebsiteoptimizer.com/ab-split-significance-calc...

Loic14y ago

If you are lazy, you can get the functions coded in PHP here: http://abtester.com/calculator/

j / k navigate · click thread line to collapse

11 comments

8 comments · 4 top-level

equark14y ago· 3 in thread

The absurdity of the frequentist approach is clear from the admonition not to look at the results of the tests too often.

mturmon14y ago

I'm sure you're aware of this, but I'm just trying to clarify the idea for other readers.

In some situations you can do this (it's related to http://en.wikipedia.org/wiki/Optional_stopping_theorem), but the way that I described above is not one of them.

equark14y ago

What's confusing is thinking about the sampling distribution. But what might have happened in some other world is of no consequence if you condition on the data rather than the parameter.

This is the likelihood principle. http://en.wikipedia.org/wiki/Likelihood_principle. See the example there and how it relates to sequential trials. It's actually rather deep. Other good links are:

http://books.google.com/books?id=_ravDT9e8nMC&lpg=PA17&#...

http://books.google.com/books?id=oY_x7dE15_AC&lpg=PA27&#...

http://projecteuclid.org/DPubS?service=UI&version=1.0...

1 more reply

medinism14y ago

1 more reply

bryanh14y ago· 1 in thread

hammock14y ago

In academia, we used p values of .95 or greater. I was taught that in business, though, the rule of thumb for making decisions is more like .8, and that's typically the standard I use as well.

DanielRibeiro14y ago

Another way to see this, is to use this online calculator: http://visualwebsiteoptimizer.com/ab-split-significance-calc...

Loic14y ago

If you are lazy, you can get the functions coded in PHP here: http://abtester.com/calculator/

j / k navigate · click thread line to collapse