Run fewer, better A/B tests (opens in new tab)

(edoconti.medium.com)

94 pointseconti5y ago22 comments

22 comments

20 comments · 9 top-level

bruce3434345y ago· 3 in thread

Between the emojis in the headings and the 2009 era memes, this was a bit of a cringy read. Also, the author seems to avoid at all costs going in depth about the actual implementation of OPE and I still don't quite understand how I would go about implementing it. Machine learning based on past A/B tests that finds similarities between the UI changes???

econtiOP5y ago

Author here! I implemented every method I described in the post in the pip library I used in the post.

In case you missed it: https://github.com/banditml/offline-policy-evaluation

xivzgrev5y ago

Yea me too.

My biggest question is where do you get user data to run the simulation? Take the simple push example - if to date you’ve only sent pushes on day 1, and you want to explore day 2,3,4,5 etc…where does that user response data come from? It seems like you need to get the data, then you can simulate various permutations of a policy. But then why not just run multi arm bandit?

ec1096855y ago

The author discusses that in the post, you need to have allowed all possibilities to run at some probability.

eximius5y ago· 3 in thread

I'm going to stick with multiarm bandit testing.

nxpnsv5y ago

This is a much better approach...

gingerlime5y ago

What tools/frameworks are you using for running and analysing results?

eximius5y ago

At my previous job when it was relevant, I wrote something in house.

A higher order component would pick which variant at runtime (cause we had problems with SSR, or that would be more appropriate). Cached the picks in cookies.

In house charting and probability calculations to determine what P(X>Y) was for each experiment pair. Then we'd just manually prune them occasionally (since the bad ones weren't being displayed, timeliness didn't much matter). Periodically re-introduce old variants by hand if we thought it was worth it.

btilly5y ago· 2 in thread

The notifications examples make me wonder what fundamental mistakes they are making.

People respond to change. If you A/B test, say, a new email headline, the change usually wins. Even if it isn't better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.

If you don't understand downsides like this, then A/B testing is going to have a lot of pitfalls that you won't even know that you fell into.

uyt5y ago

I think it's known as the "novelty effect" in the industry.

jonathankoren5y ago

Yup. A common way to handle this is to throw away the earliest responses, and then run your analysis. This is sometimes called “burn in”.

If the lift goes away, you know it wasn’t real.

1 more reply

austincheney5y ago· 2 in thread

When I was the A/B test guy for Travelocity I was fortunate to have an excellent team. The largest bias we discovered is that our tests were executed with amazing precision and durability. My dedicated QA was the whining star that made that happen. Unfortunately when the resulting feature entered the site in production as a released feature there was always some defect, or some conflict, or some oversight. The actual business results would then under perform compared to the team’s analyzed prediction.

tartakovsky5y ago

What is your advice, or more details on the types of challenges you came across and how you handled this discrepancy? I would imagine the data shifts a bit, and that standard assumptions don’t hold up around how the difference you were measuring between your “A” and your “B” remain fixed after the testing period.

austincheney5y ago

The biggest technical challenge we came across is that when I had to hire my replacement we couldn’t find competent developer talent. Everybody NEEDED everything to be jQuery and querySelectors but those were far too slow and buggy (jQuery was buggy cross browser). A good A/B test must not look like a test. It has to feel and look like the real thing. Some of our tests were extremely complex spanning multiple pages performing varieties of interactions. We couldn’t be dicking around with basic code literacy and fumbling through entry level beginner defects.

I was the team developer and not the team analyst so I cannot speak to business assumption variance. The business didn’t seem to care about this since the release cycle is slow and defects were common. They were more concerned with the inverse proportions of cheap tests bringing stellar business wins.

jonathankoren5y ago· 1 in thread

I’m pretty skeptical of this. I’ve run a lot of ML based A/B tests over my career. I’ve talked to a lot of people that have also run ML A/B tests over their careers. And the one constant everyone has discovered is that offline evaluation metrics are only somewhat directionally correlated with online metrics.

Seriously. A/B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes.

I’ve seen positive offline models perform flat. I’ve seen negative offline metrics perform positively. There’s just a lot of variance between offline and online performance.

Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure.

zwaps5y ago

Maybe I am too cynic but what we are really talking about here is causal inference for observational data based on more or less structural statistical models.

Any researcher will tell you: this is really hard. It is more than an engineering problem. You need to know not only how to deal with problems, but rather what problems may arise and what you can actually identify. Most importantly, you need to figure out what you can not identify.

There are, at least here in academia, only a limited set of people who are really good at this.

Long story short: even if offline analysis is viable, I doubt every team had the right people for it, making it potentially not worthwhile.

It is infinitely easier to produce a statistical analysis that looks good but isn’t, than one that is good. An overwhelming amount of useless offline models would, statistically speaking, be expected ;)

tootie5y ago

I've seen a lot of really sophisticated data pipelines and testing frameworks at a lot of shops. I've seen precious few who were able to make well-considered product decisions based on the data.

dr_dshiv5y ago

The challenge I've seen is to have a combination of good, small-scale Human-Centered Design research (watching people work,for instance) and good, large-scale testing. It can be really hard to learn the "why" from a/b tests otherwise.

sbierwagen5y ago

>Now you might be thinking OPE is only useful if you have Facebook-level quantities of data. Luckily that’s not true. If you have enough data to A/B test policies with statistical significance, you probably have more than enough data to evaluate them offline.

Isn't there a multiple comparisons problem here? If you have enough data to do single A/B test, how can you do a hundred historical comparisons and still have the same p value?

varsketiz5y ago

Recently I hear that booking.com is given as an example of a company that runs a lot of a/b tests. Anyone from booking reading this? How does it look from the inside, is it worth it to run hundreds at a time?

j / k navigate · click thread line to collapse