Statistical significance & other A/B test pitfalls (opens in new tab)

(cennydd.co.uk)

19 pointsjapetheape15y ago15 comments

15 comments

9 comments · 4 top-level

ugh15y ago· 4 in thread

Wait, so people who do A/B tests didn’t already do that? It drives me absolutely crazy when I don‘t have any measure to assess how likely or unlikely it is for some difference to be random.

btilly15y ago

No, people who do A/B tests have known this for years. It is the wannabes who haven't sat down and figured out the statistics who run into trouble. See http://elem.com/~btilly/effective-ab-testing/ for an OSCON tutorial that I did on the topic a couple of years ago, which includes all the gory statistical detail you could want.

Furthermore I note with interest that 2 of the 3 statistical techniques he named (Student's t test and ANOVA) only apply to cases where the observed variables are themselves normally distributed. Which is not a good description of binary yes/no outcomes. As for the remaining test, it is appropriate to use a chi-square, but statisticians tell us that the g-test is preferable.

sesqu15y ago

I don't see the problem. The total is very nearly normally distributed by the central limit theorem, is it not?

1 more reply

notahacker15y ago

I was quite surprised to find that the linked website designed to showcase A/B tests doesn't even hint at things like statistical significance or confidence intervals for the improvements

ovi25615y ago

I think most people do not do this because they do not know it is important, or like me, they do not understand the theory behind it. Neither did I know how to do it in practice.

shalmanese15y ago· 1 in thread

It's disturbing to me how p < 0.05 is used somewhat unthinkingly as the test for statistical significance simply because it's ubiquitous in science.

It seems to me if you have even a somewhat popular app, you're gathering enough data that you can afford to use p < 0.001 and avoid a lot of the complexities of statistical analysis that comes from p < 0.05. If you don't have enough data to reach p < 0.001, it's probably better to work more on increasing traffic than getting the piddling gains from A/B testing so early.

btilly15y ago

People blindly stopping at 0.05 is doubly worrying given that people tend to stop an A/B test as soon as it shows significance. That gives them multiple chances to be wrong. Furthermore if you are getting close to significance very fast, then strong significance is close behind, so why not wait?

That said, if a test has been running for a while and you don't have an answer, it can run for a looong time before it finishes. In my A/B testing tutorial I explored that starting at http://www.elem.com/~btilly/effective-ab-testing/#slide59 (just use the arrow keys to move forwards and backwards through those slides). I found that depending on whether random fluctuations took you in the same direction as the underlying bias or the opposite, there tends to be an order of magnitude difference in how long the test takes to run. Furthermore whichever is leading after many observations is usually really better, and in the worst case is overwhelmingly likely to not be much worse. Therefore there are times when it really is better to declare an answer and move on.

If you wish to formalize this, you could use the strategy used by some medical trials where they decide in advance what confidence levels will cause them to cut off early after 100 trials, 1000, trials, 10,000 trials, or to go to (say) 50,000 trials. And then they arrange that the sum of the odds that they make an early mistake are below some acceptable threshold.

_delirium15y ago

Another common issue he doesn't mention: using observed differences (or observed significance-test values) as the stopping criterion. The common statistical-significance tests don't work if the decision when to stop collecting data is dependent on the observed levels of significance. Instead, you must ahead of time decide how many trials to do, and stick to that decision, or use more complicated significance tests. (This is the "multiple testing" problem.)

For example, it works to flip two coins 50 times each, and then run a statistical-significance test. It does not work to flip two coins 50 times each, run a test; if no significance yet, continue to 100, then 150, etc. until you either find a significant difference or give up. That greatly increases the chance that you'll get a spurious significance, because your stopping is biased in favor of answering "yes": if you found a difference at 50, you don't go on to 100 (where maybe the difference would disappear again), but if you didn't find a difference at 50, you do go on to 100.

Put differently, it's using separate p-values for "what is the chance I could've gotten this result in [50|100|150|...] trials with unweighted coins?" to reject the null hypothesis each time, as if they were independent, but the null hypothesis for the entire series has to be the union, "what is the chance I could've seen this result at any of the 50, 100, 150, or 200, ... stopping points with unweighted coins?", which is higher. Yet that's exactly how many A/B tests are done: you start collecting data, and let the trials run until you find "significant" differences or give up.

(It's possible to set up a series of tests where you choose when to stop based on observed values, but you have to use different statistical machinery than the common significance-tests.)

seis615y ago

I have a test to make.

Many people think they will get will become millionaires if they follow the style of person X.

Person X is like a trial in which a coin was tossed 10000 times and got 6000 heads.

Since there is no information about the others persons, the others trials, many choose to follow the illogical thinking that they will succeed in the same way.

j / k navigate · click thread line to collapse

15 comments

9 comments · 4 top-level

ugh15y ago· 4 in thread

Wait, so people who do A/B tests didn’t already do that? It drives me absolutely crazy when I don‘t have any measure to assess how likely or unlikely it is for some difference to be random.

btilly15y ago

sesqu15y ago

I don't see the problem. The total is very nearly normally distributed by the central limit theorem, is it not?

1 more reply

notahacker15y ago

I was quite surprised to find that the linked website designed to showcase A/B tests doesn't even hint at things like statistical significance or confidence intervals for the improvements

ovi25615y ago

I think most people do not do this because they do not know it is important, or like me, they do not understand the theory behind it. Neither did I know how to do it in practice.

shalmanese15y ago· 1 in thread

It's disturbing to me how p < 0.05 is used somewhat unthinkingly as the test for statistical significance simply because it's ubiquitous in science.

btilly15y ago

_delirium15y ago

(It's possible to set up a series of tests where you choose when to stop based on observed values, but you have to use different statistical machinery than the common significance-tests.)

seis615y ago

I have a test to make.

Many people think they will get will become millionaires if they follow the style of person X.

Person X is like a trial in which a coin was tossed 10000 times and got 6000 heads.

Since there is no information about the others persons, the others trials, many choose to follow the illogical thinking that they will succeed in the same way.

j / k navigate · click thread line to collapse