Suppose I have a website with a "Click Me" button that's green in color. I want to increase clicks and think to myself, "perhaps if it was a red button instead of a green button, more people would click!" To test this, I would run an A-B test along the lines of:
if random(2) == 0 then color='red' else color='green';
In theory, I just push this code and track the number of clicks on the red button versus the green button and then pick the best. But in practice, when I push the code, there might be 5 clicks on green and none on red in the first hour. Maybe green is better? Maybe I didn't wait long enough? Okay, let's wait longer. A few hours later, there's now 10 clicks on red and only 6 clicks on green. Okay, so red is better? Let's wait even longer. A week later, there's 5000 clicks on red and 4500 clicks on green. That seems like enough data that I can make a conclusion about red vs. green. But is there a better way?
This is where A-A-B-B testing can help. Let's start by looking at just the A-A part of the test. If I split my audience into two groups (green1 and green2) and show them both green buttons, the results should be identical because both buttons are green. If I check back in an hour and the "green1" and the "green2" groups are off by 20%, then I have a large margin of error and need to wait longer. If I check back in 6 hours and they're off by 10%, then I need to wait longer. If I check back in a day and green1 and green2 are only off by 1% then that means we've probably waited long enough and my margin of error is around 1%. I can now add green1+green2 and compare it to red1+red2 groups and see if there's a clear winner (e.g. red is 5% better). And this only took a day instead of a week!
Microsoft Research suggested (http://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTh...) that you continuously run A/A tests alongside your experiments. An A/A test can:
- Collect data and assess its variability for power calculations
- test the experimentation system (the Null hypothesis should be rejected about 5% of the time when a 95% confidence level is used)
- tell if users are split according to the planned percentages
You'll probably have to ensure it applies sequentially too, at least to be sure As and Bs are stable in their matching, but it seems to me an elegant solution for the problem (not that I'm statistician, though).
I ran an online marketplace at a previous gig. Our service providers always complained that they didn't know what to charge to maximize their business. They couldn't see the forest as a tree. Because we had the data for all providers, we started letting them know if they were under- or over-priced, and we saw more conversions and revenue.
Dynamic pricing (like Uber does on holidays) alone could be hugely valuable.
Isn't that exactly what they're not doing?
As a conversion optimization guy, I see my fare share of "button color" stories--which are as much of a mockery as "growth hacking". So I was glad to see they're doing this right: Tracking conversions and paths by cohorts; cautious of false positives; testing big changes instead of, eg, button colors; combining test results with qualitative data (usability tests) when making decisions; ...
I've previously thought about writing a little program that keeps an eye on the major hotels in the city, so I can work out when rooms are more in demand and amend my pricing.
Even a "this is the average cost of similar rentals within a X radius of you" would be helpful.
It would be beneficial to AirBnB too. They'd get more money in fees and money flowing through their system. It also works both ways - if I reduce pricing in low demand times, it's possible I'd get bookings I wouldn't have before - again increasing the amount for AirBnB.
Someone with an afternoon free should really step on this. Just make sure it doesn't end up starting price wars with itself!
Call it Autopilot, or something.
Did your TOS allow for this? At one of the early SaaS firms we really wanted to publish metrics (well, sell..) about how each firm was doing relative to their peers.
They all wanted this info, but at the same time were strongly opposed to our using their data in the studies.
When is AirBnB going to experiment with helping their hosts follow the law? I bet I can predict that graph. Why, look at all those illegal rentals in SF right there in the sample screenshots--oh the irony.
Remember, DON'T FUCK UP THE CULTURE! But it's OK to fuck up your host city for a buck or 2 billion.
Learn to deal with change instead of bitching about everyone who bucks the trend. I assume you feel the same way about Uber and Google Fiber displacing all those taxi commissions and telecom monopolies.
Now, I've used Airbnb a dozen times, but I can see the concern for people in the above situation. I've had people ask me when I'm renting certain Airbnb apartments if I live in the building, and they seemed a little upset with the idea of me just passing through for a week.
In case you haven't heard: http://en.wikipedia.org/wiki/Tragedy_of_the_commons
Also: there is a local law on the books, duly enacted by a democratic process, but you're arguing nobody needs to follow that law because you think nobody is being impacted? Is that your position? I have to assume you're an AirBnB host, so I wonder if you've contacted every one of your neighbors to see if they're cool with your gig.
Rate of deployment of experiments is a better focus; since all your opponents are bound to copy your winners anyways, you have to rely on the few months edge you've earned before they do so, and constantly maintain that lead.
Funnily enough, the page they reference for calculating the right sample size actually talks about sequential analysis, but AirBnB doesn't mention this in describing their solution...
http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi...
Try setting your p-value to your Type 1 error rate divided by the number of tests you perform. It will be much smaller, and this is a good thing. Significance should really test for significance, not random chance.