Experiments at Airbnb (opens in new tab)

(nerds.airbnb.com)

164 pointslennysan12y ago43 comments

43 comments

35 comments · 10 top-level

wkonkel12y ago· 8 in thread

A simple hack is to run an A-A-B-B test instead of an A-B test. Rather than splitting 50-50, use 25-25-25-25 splits. When A1==A2 and B1==B2, then you know that you have statistically relevant data and you can compare A to B. Depending on the dataset, this could happen in minutes or weeks.

wkonkel12y ago

To explain this in a different way, let's use a simplified example:

Suppose I have a website with a "Click Me" button that's green in color. I want to increase clicks and think to myself, "perhaps if it was a red button instead of a green button, more people would click!" To test this, I would run an A-B test along the lines of:

if random(2) == 0 then color='red' else color='green';

In theory, I just push this code and track the number of clicks on the red button versus the green button and then pick the best. But in practice, when I push the code, there might be 5 clicks on green and none on red in the first hour. Maybe green is better? Maybe I didn't wait long enough? Okay, let's wait longer. A few hours later, there's now 10 clicks on red and only 6 clicks on green. Okay, so red is better? Let's wait even longer. A week later, there's 5000 clicks on red and 4500 clicks on green. That seems like enough data that I can make a conclusion about red vs. green. But is there a better way?

This is where A-A-B-B testing can help. Let's start by looking at just the A-A part of the test. If I split my audience into two groups (green1 and green2) and show them both green buttons, the results should be identical because both buttons are green. If I check back in an hour and the "green1" and the "green2" groups are off by 20%, then I have a large margin of error and need to wait longer. If I check back in 6 hours and they're off by 10%, then I need to wait longer. If I check back in a day and green1 and green2 are only off by 1% then that means we've probably waited long enough and my margin of error is around 1%. I can now add green1+green2 and compare it to red1+red2 groups and see if there's a clear winner (e.g. red is 5% better). And this only took a day instead of a week!

bjlorenzen12y ago

Using four buckets instead of two like that will improve your confidence in the results, but will also double the required sample / testing duration. You could just as easily use two buckets and wait twice as long to achieve the same effect.

blauwbilgorgel12y ago

A/A testing (Null testing) or A/A/B testing gives a different effect than A/B testing.

Microsoft Research suggested (http://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTh...) that you continuously run A/A tests alongside your experiments. An A/A test can:

- Collect data and assess its variability for power calculations

- test the experimentation system (the Null hypothesis should be rejected about 5% of the time when a 95% confidence level is used)

- tell if users are split according to the planned percentages

gingerlime12y ago

Can you explain why? I'm struggling with the math behind the whole thing as it is, but intuitively this sounds like a very clever hack. I wonder why it would double the experiment time if effectively people are seeing either A or B variants.

oelmekki12y ago

That comment is brilliant, thanks for contributing it.

You'll probably have to ensure it applies sequentially too, at least to be sure As and Bs are stable in their matching, but it seems to me an elegant solution for the problem (not that I'm statistician, though).

kansface12y ago

This is better than stopping when you get a statistically significant finding which is nearly always the wrong thing to do. Do you have any math behind this?

intev12y ago

I'm not sure I understand - isn't that essentially an A/B test because 25 + 25 = 50?

DHowett12y ago

I believe it lets you compensate for the possibility that, say, all of your conversions might be coming from the bottom 1% of your users. Segmenting A into A1/A2 therefore insulates your interpretation of the results for A from being as heavily skewed.

1 more reply

nostromo12y ago· 7 in thread

Airbnb could likely get a lot more bang for their buck by letting hosts run experiments on pricing than by testing button colors and whatnot.

I ran an online marketplace at a previous gig. Our service providers always complained that they didn't know what to charge to maximize their business. They couldn't see the forest as a tree. Because we had the data for all providers, we started letting them know if they were under- or over-priced, and we saw more conversions and revenue.

Dynamic pricing (like Uber does on holidays) alone could be hugely valuable.

gk112y ago

> testing button colors and whatnot.

Isn't that exactly what they're not doing?

As a conversion optimization guy, I see my fare share of "button color" stories--which are as much of a mockery as "growth hacking". So I was glad to see they're doing this right: Tracking conversions and paths by cohorts; cautious of false positives; testing big changes instead of, eg, button colors; combining test results with qualitative data (usability tests) when making decisions; ...

coffeecheque12y ago

I agree. I've had to spend some time lately on the host pricing page for an AirBnB rental and I would love some more advanced features.

I've previously thought about writing a little program that keeps an eye on the major hotels in the city, so I can work out when rooms are more in demand and amend my pricing.

Even a "this is the average cost of similar rentals within a X radius of you" would be helpful.

It would be beneficial to AirBnB too. They'd get more money in fees and money flowing through their system. It also works both ways - if I reduce pricing in low demand times, it's possible I'd get bookings I wouldn't have before - again increasing the amount for AirBnB.

zaroth12y ago

Classic 'push this button to make more money' app, with a nice path to acquisition baked in.

Someone with an afternoon free should really step on this. Just make sure it doesn't end up starting price wars with itself!

Call it Autopilot, or something.

chiph12y ago

Because we had the data for all providers, we started letting them know if they were under- or over-priced, and we saw more conversions and revenue.

Did your TOS allow for this? At one of the early SaaS firms we really wanted to publish metrics (well, sell..) about how each firm was doing relative to their peers.

They all wanted this info, but at the same time were strongly opposed to our using their data in the studies.

jaredsohn12y ago

This talk may interest you (includes speakers from Uber and Airbnb.)

http://nerds.airbnb.com/openair-algorithmic-pricing/

biscarch12y ago

The phrase "forest as a tree" tripped me up. Did you mean "forest for the trees"?

arg0112y ago

It will be based on the idiom "forest for the trees", effectively having the same meaning but also includes the fact that they are a part of what makes up the forest.

205guy12y ago· 5 in thread

Ok, I'll be "that" guy who heckles every AirBnB post, even if this one did have some nice graphs (and ideas).

When is AirBnB going to experiment with helping their hosts follow the law? I bet I can predict that graph. Why, look at all those illegal rentals in SF right there in the sample screenshots--oh the irony.

Remember, DON'T FUCK UP THE CULTURE! But it's OK to fuck up your host city for a buck or 2 billion.

meritt12y ago

Following the law would mean not renting out their properties in SF. People clearly find value in doing so. If AirBNB goes away, the same owners will just switch to Craigslist or the next alternative.

Learn to deal with change instead of bitching about everyone who bucks the trend. I assume you feel the same way about Uber and Google Fiber displacing all those taxi commissions and telecom monopolies.

chris_jg12y ago

You rant about "follow the law" but who exactly is being hurt? The person making rent off his/her extra room? Please give examples or real harm rather than, "follow the law" statements. I suppose you're the kind of person that'd turn in Ann Frank. That'd be following the law.

User981212y ago

Well, if I lived in an apartment building, I'd like to have real neighbors, people that speak my language, enjoy a sense of community, and friends in the building. It would get quite annoying if I come home every evening to a new set of people on vacation or backpacking, and checking in and out of the surrounding apartments. The building shouldn't be a hotel, it's my home.

Now, I've used Airbnb a dozen times, but I can see the concern for people in the above situation. I've had people ask me when I'm renting certain Airbnb apartments if I live in the building, and they seemed a little upset with the idea of me just passing through for a week.

judk12y ago

http://en.m.wikipedia.org/wiki/Godwin's_law

205guy12y ago

We have a Godwinner!!!

In case you haven't heard: http://en.wikipedia.org/wiki/Tragedy_of_the_commons

Also: there is a local law on the books, duly enacted by a democratic process, but you're arguing nobody needs to follow that law because you think nobody is being impacted? Is that your position? I have to assume you're an AirBnB host, so I wonder if you've contacted every one of your neighbors to see if they're cool with your gig.

2 more replies

bjlorenzen12y ago· 2 in thread

As a developer working for a major competitor to airbnb on a shopping page, and having implemented hundreds of experiments on my page, I can say that these guys are way too obsessed with statistical certainty.

Rate of deployment of experiments is a better focus; since all your opponents are bound to copy your winners anyways, you have to rely on the few months edge you've earned before they do so, and constantly maintain that lead.

thinkmoore12y ago

Unless random high bias means your "edge" is exactly the opposite.

001sky12y ago

This is about proportionality. Finite window trading strategies need to take into account the link between implementation overhead (time) and overall profitability (linked to time). Just the same way they need to link pricing and profits (linked to pricing). That seems to be the crux of the issue.

1 more reply

RA_Fisher12y ago· 2 in thread

The cult of statistical significance is alive and well. A 0.05 p-value implies a 1:20 chance of "alternative" performing worse upon final installation. That's rather risk adverse. It also implies that "alternative" is worse from the get-go. When is that the case? Type 1 and Type 2 errors are much more balanced in web apps. Anyone care to show me why that's a bad mentality?

thinkmoore12y ago

No, it doesn't. It means that there is a 1 in 20 chance that you would have seen results as good or better if your change had no effect (assuming a standard one-sided hypothesis test). Thus if the effect appears to be good, you should take the test as some evidence that it is worth implementing.

RA_Fisher12y ago

Right, that's a specific use, but I'm speaking of a two-sided test where you're indifferent between alternatives.

jessriedel12y ago· 1 in thread

I wish AirBnB would make the cost scale logarithmic, to match the fact that this is roughly how the prices will be distributed too. I'm usually only using the left-most 5% of that slider.

ansimionescu12y ago

Isn't there any better alternative to sliders, though? I never use them, as I usually know exactly how much I'm willing to spend, and will probably not feel comfortable going out of my bounds anyway.

thinkmoore12y ago

Statisticians have spent time thinking about the right way to deal with these sorts of problems for a long time: https://en.wikipedia.org/wiki/Sequential_analysis.

Funnily enough, the page they reference for calculating the right sample size actually talks about sequential analysis, but AirBnB doesn't mention this in describing their solution...

sutterbomb12y ago

HN user btilly has a really helpful essay on the math behind stopping tests earlier than your predetermined sample size. It calls for setting a maximum duration, and provides stopping points along the way. Works similar to the method AirBnB describes.

http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi...

coherentpony12y ago

This article contains some serious p-value abuse. The p-value should be adjusted to account for multiple testing. You do this to minimise the effect that a hypothesis would be accepted purely due to random chance.

Try setting your p-value to your Type 1 error rate divided by the number of tests you perform. It will be much smaller, and this is a good thing. Significance should really test for significance, not random chance.

cbovis12y ago

Can anyone point out a good introduction to some of the methods used in the article? Terms such as the p-value, treatment effect etc.

j / k navigate · click thread line to collapse

43 comments

35 comments · 10 top-level

wkonkel12y ago· 8 in thread

wkonkel12y ago

To explain this in a different way, let's use a simplified example:

if random(2) == 0 then color='red' else color='green';

bjlorenzen12y ago

blauwbilgorgel12y ago

A/A testing (Null testing) or A/A/B testing gives a different effect than A/B testing.

Microsoft Research suggested (http://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTh...) that you continuously run A/A tests alongside your experiments. An A/A test can:

- Collect data and assess its variability for power calculations

- test the experimentation system (the Null hypothesis should be rejected about 5% of the time when a 95% confidence level is used)

- tell if users are split according to the planned percentages

gingerlime12y ago

oelmekki12y ago

That comment is brilliant, thanks for contributing it.

kansface12y ago

This is better than stopping when you get a statistically significant finding which is nearly always the wrong thing to do. Do you have any math behind this?

intev12y ago

I'm not sure I understand - isn't that essentially an A/B test because 25 + 25 = 50?

DHowett12y ago

1 more reply

nostromo12y ago· 7 in thread

Airbnb could likely get a lot more bang for their buck by letting hosts run experiments on pricing than by testing button colors and whatnot.

Dynamic pricing (like Uber does on holidays) alone could be hugely valuable.

gk112y ago

> testing button colors and whatnot.

Isn't that exactly what they're not doing?

coffeecheque12y ago

I agree. I've had to spend some time lately on the host pricing page for an AirBnB rental and I would love some more advanced features.

I've previously thought about writing a little program that keeps an eye on the major hotels in the city, so I can work out when rooms are more in demand and amend my pricing.

Even a "this is the average cost of similar rentals within a X radius of you" would be helpful.

zaroth12y ago

Classic 'push this button to make more money' app, with a nice path to acquisition baked in.

Someone with an afternoon free should really step on this. Just make sure it doesn't end up starting price wars with itself!

Call it Autopilot, or something.

chiph12y ago

Because we had the data for all providers, we started letting them know if they were under- or over-priced, and we saw more conversions and revenue.

Did your TOS allow for this? At one of the early SaaS firms we really wanted to publish metrics (well, sell..) about how each firm was doing relative to their peers.

They all wanted this info, but at the same time were strongly opposed to our using their data in the studies.

jaredsohn12y ago

This talk may interest you (includes speakers from Uber and Airbnb.)

http://nerds.airbnb.com/openair-algorithmic-pricing/

biscarch12y ago

The phrase "forest as a tree" tripped me up. Did you mean "forest for the trees"?

arg0112y ago

It will be based on the idiom "forest for the trees", effectively having the same meaning but also includes the fact that they are a part of what makes up the forest.

205guy12y ago· 5 in thread

Ok, I'll be "that" guy who heckles every AirBnB post, even if this one did have some nice graphs (and ideas).

Remember, DON'T FUCK UP THE CULTURE! But it's OK to fuck up your host city for a buck or 2 billion.

meritt12y ago

chris_jg12y ago

User981212y ago

judk12y ago

http://en.m.wikipedia.org/wiki/Godwin's_law

205guy12y ago

We have a Godwinner!!!

In case you haven't heard: http://en.wikipedia.org/wiki/Tragedy_of_the_commons

2 more replies

bjlorenzen12y ago· 2 in thread

thinkmoore12y ago

Unless random high bias means your "edge" is exactly the opposite.

001sky12y ago

1 more reply

RA_Fisher12y ago· 2 in thread

thinkmoore12y ago

RA_Fisher12y ago

Right, that's a specific use, but I'm speaking of a two-sided test where you're indifferent between alternatives.

jessriedel12y ago· 1 in thread

I wish AirBnB would make the cost scale logarithmic, to match the fact that this is roughly how the prices will be distributed too. I'm usually only using the left-most 5% of that slider.

ansimionescu12y ago

Isn't there any better alternative to sliders, though? I never use them, as I usually know exactly how much I'm willing to spend, and will probably not feel comfortable going out of my bounds anyway.

thinkmoore12y ago

Statisticians have spent time thinking about the right way to deal with these sorts of problems for a long time: https://en.wikipedia.org/wiki/Sequential_analysis.

Funnily enough, the page they reference for calculating the right sample size actually talks about sequential analysis, but AirBnB doesn't mention this in describing their solution...

sutterbomb12y ago

http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi...

coherentpony12y ago

cbovis12y ago

Can anyone point out a good introduction to some of the methods used in the article? Terms such as the p-value, treatment effect etc.

j / k navigate · click thread line to collapse