In case you missed it: https://github.com/banditml/offline-policy-evaluation
My biggest question is where do you get user data to run the simulation? Take the simple push example - if to date you’ve only sent pushes on day 1, and you want to explore day 2,3,4,5 etc…where does that user response data come from? It seems like you need to get the data, then you can simulate various permutations of a policy. But then why not just run multi arm bandit?
A higher order component would pick which variant at runtime (cause we had problems with SSR, or that would be more appropriate). Cached the picks in cookies.
In house charting and probability calculations to determine what P(X>Y) was for each experiment pair. Then we'd just manually prune them occasionally (since the bad ones weren't being displayed, timeliness didn't much matter). Periodically re-introduce old variants by hand if we thought it was worth it.
People respond to change. If you A/B test, say, a new email headline, the change usually wins. Even if it isn't better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.
If you don't understand downsides like this, then A/B testing is going to have a lot of pitfalls that you won't even know that you fell into.
If the lift goes away, you know it wasn’t real.
I was the team developer and not the team analyst so I cannot speak to business assumption variance. The business didn’t seem to care about this since the release cycle is slow and defects were common. They were more concerned with the inverse proportions of cheap tests bringing stellar business wins.
Seriously. A/B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes.
I’ve seen positive offline models perform flat. I’ve seen negative offline metrics perform positively. There’s just a lot of variance between offline and online performance.
Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure.
Any researcher will tell you: this is really hard. It is more than an engineering problem. You need to know not only how to deal with problems, but rather what problems may arise and what you can actually identify. Most importantly, you need to figure out what you can not identify.
There are, at least here in academia, only a limited set of people who are really good at this.
Long story short: even if offline analysis is viable, I doubt every team had the right people for it, making it potentially not worthwhile.
It is infinitely easier to produce a statistical analysis that looks good but isn’t, than one that is good. An overwhelming amount of useless offline models would, statistically speaking, be expected ;)
Isn't there a multiple comparisons problem here? If you have enough data to do single A/B test, how can you do a hundred historical comparisons and still have the same p value?