I believe the author's confusion stems from attempting to model every interaction term, e.g. if you think there will be some extra effect to being Male and using Windows and being Swedish that is more than the "Male effect" and "Windows effect" and the "Sweden effect" considered individually.
I wrote a little article on implementing a simple multivariate A/B test from scratch:
http://www.evanmiller.org/linear-regression-for-fun-and-prof...
All you really need is a function that inverts a matrix and you're good to go. The cool part is that you can actually throw away most of the data that comes in from the firehose, just keep updating the Hessian until you're ready to invert it and bam, you have a model.
That's a good blog post.
There are many ways to handle multi-variate data, with different tradeoffs. The linear model you describe in your post is one of the classic machine learning models often used. In the post, we address the shortcoming of such models in the subsection "Trading off exploration vs. Exploitation."
As we point out, such a model ignores the trade-off between exploration and exploitation. That means that you have to gather all your data by randomly serving variants, before you can figure out the right variant to serve to each user.
That's very expensive in terms of serving-the-wrong-variant, if what you are trying to predict is conversion or revenue. By contrast, our method will quickly realise if a particular variant has a very high expected value for the current user (with the current user's particular combination of features), and so serve that user that variant - even if we aren't confident about other users.
So, like any bandit algorithm, we trade off exploration and exploitation; but further, we trade this off not just across the population as a whole, but on a personalised basis. That's a very important functionality that the approach you describe doesn't have.
Why is this an issue? Because if you have to gather data by randomly sampling, what if a certain sort of user doesn't appear very often? (e.g. lets say for a particular site, that's a 65 year old female from Canada). Do you randomly sample until you see a user like this? If so, you are randomly sampling for a long time. Or do you stop your exploration before you see rare users? This totally neglects your rarer users. What if rare users have unusual preferences, or are particularly valuable? (E.g. maybe your rare users are your biggest spenders - which happens.)
And, if you have high dimensional data, all your users are effectively rare (because you rarely see users with that exact combination of features).
Furthermore, the 'gather data randomly' approach is also impractical if user behaviour is dynamic, as it often is in real-world settings. (i.e. the preference of your user base as a whole changes - maybe they've all recently read an article, so a different marketing copy now resonates with them)
Finally, sometimes those interactions you mention, between different features, are really important, and shouldn't be neglected.
I could go on - there's a lot of subtle issues that the approach we are using solves that simple multivariate testing doesn't handle, basically.
What we're doing is trying to provide an API that wraps all this knowledge up into a service that's easy for people to use.
This is discussed in many places, but here are a few examples:
http://www.johndcook.com/blog/2010/01/19/dont-invert-that-ma... http://en.wikipedia.org/wiki/Linear_least_squares_(mathemati...
It has successfully increased Wikipedia's donation revenue; however, some fundamental assumptions baked into the AB-testing approach are all wrong; this leaves a lot of money on the table.
Instead of trying to find a single best version for everyone, people should realise that different segments of their users are going to have different preferences. Machine learning can find these in an automated way, and there's a lot of value to be gained - but people need to see past the standard 'one size fits all' AB-testing approach to take advantage of it.
A better headline might have been "Hidden assumptions in Wikipedia's A/B testing and how it could be improved". Also, why are we picking on Wikipedia here? Don't a lot of companies claim to do A/B testing and isn't this a problem inherent in all simple A/B testing?
Although it's just my guesstimate, I'm imagine that the broad approach to A/B testing (or just doing it at all in the first place!), will be the most important and easiest fruit to grab. Whilst a multi-armed bandit solution will probably only provide much smaller returns on top. So given a choice between a wikipedia dev concentrating on improving the overall message, as opposed to trying to improve 5 or more cohorts at once - I'd go for the easier option until it stops giving significant returns.
That's not to say it shouldn't be done eventually. Like you said, it's potential money left on the table!
Should Wikipedia use a contextual bandit system? Assuming the cost is equivalent, I say yes, for the following reasons:
- they have a shedload of traffic, and then some; and
- they have a limited time in which the campaign is running.
Let's address these in turn.
The regret bounds on contextual bandits are O(d sqrt(T)), where d is the dimensionality of the user profile you're using and T is time (see e.g. http://arxiv.org/abs/1209.3352). In contrast, the standard stochastic bandit has regret bounds of O(log(T)), which is much tighter. The summary is you need a lot more data to explore the bigger range of possibilities you've created. So I'm not sure contextual bandits are viable for smaller sites, but Wikipedia definitely has the traffic to make them work.
If your test is running for a limited time you can just delete the code when you're done. This has several advantages. People running A/B tests typically want to answer the question "what is best?" (To crush your enemies, obviously, but you can't usually say that) under the assumption that you'll then display the best variant forever. Contextual bandit algorithms don't answer this question; instead they minimise regret (equivalently, making you the most money) which makes more sense in the limited time setting. In fact there is no "best" with a contextual bandit -- the best variant depends on the user profile. This means you need to run the algorithm forever: good if you're collecting monthly subscriptions based on algorithm use, but bad if you're trying to manage a site with an exploding number of variants under test that you can never delete.
Ob plug: If you've found this post interesting you should consider signing up for the bandit course I'm running: http://noelwelsh.com/data/2013/08/16/free-bandit-course-draf... It will cover contextual bandits, and a whole heap of other interesting stuff.
Its definitely viable for smaller sites. Your argument there is about the bounds in a particular theoretical situation ("when the contexts are provided by an adaptive adversary.")
In a real world situation, your contexts are provided by the normal behaviour of your users, not an adversary. And, rather than the patterns being maximally hard to find, there'll instead be a lot of low-hanging-fruit - in terms of very obvious combinations of features that define different segments of users.
What's important is not the bound of the contextual bandit in a theoretical situation, but rather what performance it gives in 90% of business situations. In many situations, doing some personalisation to automatically find the most obvious user groups is a huge win.
And later, as the amount of data increases, the granularity of segmentation will effectively increase.
I agree with your general comments on bandits, and love what Myna is doing.
The company whose ad this post was is not showing much data about the difference between the solution it's selling and A/B testing. I wonder if there isn't any or if it is not supporting it's price?
Yes, it's currently difficult and expensive to bring complex ML to bear on this problem; we are trying to make that easy and cheap. The reason people adopted AB-testing was that it has a positive impact on the metrics they care about; we think the workflow can be easier and more valuable than AB testing. If its easy to use, the extra value should cover the cost.
We're currently focused on trying to communicate that this approach can create value - this blog post, written for a technical audience, is to give some intuition around how the approach works - does that make sense?
We genuinely believe in making it easy to use ML optimisation in this context - that's why we're working on this.
Wikipedua's AB testing is correct in as far as the problem and approach have been defined. This article suggests an improvement.
Wrong is when they're trying to test for something but due to some error they aren't actually testing for what they think they are.
Something can be accurate/correct but also improvable.
That's why we're working on this problem.
As I said elsewhere, we think AB-testing is a big improvement over nothing, but, like any powerful tool, it can blind people to its downsides.
Our blog title is designed to point out, succinctly, that we think the vanilla AB-testing framework is wrong for Wikipedia, as an easy-to-understand example. We believe that.
I notice that you volunteer for Wikipedia; I want to be clear, we're not trying to pick on Wikipedia here; I personally love Wikipedia. Its just a good well-known example of a successful AB-testing campaign, that we think could be way improved.
Read more about the research done in Clinical Trials, and one of my favorite Professors here: http://web.eecs.umich.edu/~qstout/AdaptiveDesign.html
We made a case-study Wikipedia because their banners are a well-known example of A/B testing. Plus, because their banners make tens of millions of dollars each year, and they have an entire team of people working on the A/B tests over a period of months, I think they have capacity to use something more sophisticated. Maybe, like you suggest, they're just focused on other aspects of the fundraising campaign. I guess the only way to know would be to hear from some of the Wikipedia team!
There's the example of location, and I guess you can also get browser information (mobile device vs. desktop), but other than that, I feel there isn't that much in terms of context to work with with first-time anonymous users, which is the most common case for A/B testing.
Would love to hear how else you're classifying users, and what other buckets you're using.
Maybe people reading computer science articles prefer light blue, while people reading medicine articles prefer gold.
Of course, a quick look at Wikipedia's categories show there are many possible article categorisation schemes (are sneakers footwear like stilettos or sports equipment like cricket bats? Of course they're both), so this would introduce many additional variables - more than you could exhaustively A/B test.
There are many important pieces of information you can grab from a user who's never been on your site based on basic information available to all web servers, such as the user's IP address and user agent: (1) the time of day/day of week they're visiting -- relevant because, e.g., night users or weekend users might be a different demographic (2) information related to the particular page they're visiting, such as article category on wikipedia (3) operating system type and age (4) device type -- smartphone, tablet, pc (5) geographic location.
You can use location to infer a range of demographic features, such as income, education, etc. Due to its importance, location is an attribute that Synference adaptively coarsens/zooms in on. For example, if you have a lot of users from a certain city, then each user within that city will have a location whose granularity goes down to the neighborhood. But if you have relatively few users from Europe, then each user's location might only be at the level of the country.
The Synference API makes it easy to extract this information. All you have to do is send the API the user's IP address and user agent, and the API parses those two pieces of information into most of the features I've just listed. We also add a timestamp (corrected for the user's timezone using the IP address's geographic location).
We don't need to split users into buckets---machine learning classifiers build models that automatically take care of deciding how a user's attributes can be used to predict the target metric.
I'm sure the A/B testing can be improved but not at all convinced how it's done now is "wrong".