Airbnb open-sources Caravel: data exploration and visualization platform (opens in new tab)

(github.com)

422 pointscaravel10y ago92 comments

92 comments

70 comments · 18 top-level

twakefield10y ago· 17 in thread

If anyone involved with this is around, I'm curious why Airbnb would build something like this - cost, performance, features, all of the above? Data querying and visualization is a pretty crowded field with a lot of commercial options to choose from[1].

I'm not knocking Caravel (it looks amazing) just curious why build vs buy in this case.

[1] Tableau, Looker, Periscope, Chartio, Qlikview, Gooddata are just some that come to mind.

caravelOP10y ago

Well none of the solutions mentionned are open source.

Free as in beer is one incentive as licenses are not cheap, and vendors know when they have you locked down and tend to milk everything they can.

More importantly, software for which we don't have control over the source is a risk. In this day and age anyone that cares enough should be able to push a bugfix/hotfix overnight. What if you'd have to wait for entire quarters or years for Tableau to parallelize their "live mode", or to get connectivity to Presto to work?

What if you want to integrate a new type of visualization that isn't supported? What if you want to integrate with your anomaly detection framework or your A/B testing framework or other internal or external facing applications?

Since this is a common need for most companies, it makes sense to have an open source solution that we can all use and collaborate on.

timr10y ago

"Free as in beer is one incentive as licenses are not cheap, and vendors know when they have you locked down and tend to milk everything they can."

Free as in beer is never the answer. A project like this takes multiple engineer-years to build and maintain. That's hundreds of thousands of dollars, at least. How much is a site license? Are you sure? Have you negotiated the rate?

Even for "expensive" services, buying it from someone else is almost always cheaper than paying someone to maintaining it yourself, because expensive services are usually expensive for a good reason: they're niche, and finding someone with the expertise to build it is expensive. And having the source is for a product so that you can customize it is certainly a better answer, but it rarely happens, in practice. It's why we have gobs of open-source Apache-foundation products that nobody in their right mind wants to host in-house, unless they absolutely have to.

Developers have a real, well-documented resistance to paying for things, and it sucks. Because in reality, most development of open-source tools happens when someone gets paid to maintain the tool. If they don't, the tool falls into disrepair. Open-source software isn't free -- it's just paid for by someone else.

4 more replies

phunge10y ago

They mention that it was originally paired with Druid. The data volume that Druid excels at (and that AirBnB must have) is orders of magnitude larger than what Tableau&Looker do well with. It's probably just built for bigger-than-SQL OLAP usecases.

dwmintz10y ago

It's worth distinguishing between the tools that leave the data in your data warehouse (Caravel, Periscope, Mode, Looker (where I work)), and those that have their own data stores (Good Data, Qlikview, etc.) Tableau can connect directly to your datastore, but it's happier if it can operate on data that's stored locally in-memory.

Anyway, the ones where bring your own database can scale as far as the database can bring you.

1 more reply

jsiegz10y ago

Looker scales with plenty of data for us at Snapchat, it's more about the underlying database than the BI tool.

1 more reply

adrr10y ago

Looker performance comes from the underlying storage of data. You can store massive amounts of data on say a Red Shift cluster and still be performant. Most visualization tools that i know are directly tied to storage tier in terms of performance.

1 more reply

tlrobinson10y ago

To throw out an open source one: Metabase http://www.metabase.com https://github.com/metabase/metabase

Disclaimer: I work on Metabase.

gedrap10y ago

Not really related to your product, but I find the trend of using emojis in commit messages really distracting. It's probably okay-ish very very rarely for something huge, but commit log reading like slack chat... Em not sure about it.

dwmintz10y ago

Can't speak for Airbnb, but I'm not sure that any of the front-end clients that you mentioned (disclosure: I work at Looker), can talk to Druid. So if Airbnb already had a Druid warehouse in place, they may have decided it was easier to roll their own front-end than migrate to a different backend.

caravelOP10y ago

Druid is definitely part of the equation.

Larger, data-driven companies with significant engineering teams prefer not relying on 3rd party, closed-sourced vendors. That can represent a significant risk and a blockage for deeper integration with other internal applications when needed.

Not that building always wins over buying, but the balance shifts relatively to the size of the company.

Also, when using open source on the receiving end of the equation, you want to be a good citizen and contribute back to the ecosystem. It ties to pride, passion, and reflect a strong engineering culture, which can help with recruiting.

1 more reply

ernestbro10y ago

There is a study done by a market analyst (Wayne Eckerson) about Building vs Buying BI, he has some good insights based on surveys as to why some companies chose to build and some choose to buy

71% of those who chose to build the BI tools said they built because "We can customize the functionality better"

51% of those who buy say "Buying enables us to provide best-in-class BI functionality"

The study: http://www.jaspersoft.com/sites/default/files/confirmation_f...

ccozan10y ago

It is even crowded than that on the commercial offering [0]. But having such a tool as Open Source benefits us all. Especially when trying to connect to obscure or non standard data sources.

[0] https://en.wikipedia.org/wiki/Online_analytical_processing#M...

ljk10y ago

part of it is probably recruiting

andreasklinger10y ago

I am pretty sure they used multiple (and then most likely at some point too many) of those tools

maerF0x010y ago

Few more for your list: Leftronic (service) or Grafana (package)

hathym10y ago

investor's money, baby!

nedwin10y ago

aesthetics

cauthon10y ago· 7 in thread

What's the name of this style of plot?

https://camo.githubusercontent.com/c22acad6c1302c5da3236cb8e...

simonsarris10y ago

Sankey: http://gojs.net/latest/samples/sankey.html

devy10y ago

I believe it's called "Sankey Diagram" as denoted in the dropdown menu on the upper top left.

Here is the original demo[1] from Mike Bostock, D3's author.

[1]: https://bost.ocks.org/mike/sankey/

nthitz10y ago

Though they have been around much longer than D3 has... https://en.wikipedia.org/wiki/Sankey_diagram

divideby010y ago

I believe it's called a Sankey diagram:

http://bl.ocks.org/d3noob/5028304

RobPfeifer10y ago

Sankey Diagrams. Useful for complex funnels!

bduerst10y ago

We always just called them flow diagrams. I didn't know they had a specific name.

kayhi10y ago

Where is the data from? It's a great overview of CO2 emissions, etc.

mooneater10y ago· 7 in thread

I really need a great data explorer/dashboard for my postgres-based systems. I was going to use shiny but this looks really nice -- I hope the docs can be built out very soon. Can anyone comment on other competing products? In the commercial space, I like Looker but its too pricey.

dwmintz10y ago

Really depends on your needs. There are lots of options out there that are happy to talk to Postgres, but each has different strengths and weaknesses. If all you need is a way to basically share and visualize the output of SQL queries and everyone who's using the tool can write SQL well, then look at Periscope or Mode.

If you're ok pulling the data out of Postgres into memory locally and mostly care about manipulation and beautiful dataviz, then look at Tableau.

If you're mostly interested in more data sciency/ML stuff, then Shiny or something else that's R-based is a good option.

If you're interested in being able to embed your business logic into the tool so that non-SQL folks can build their own queries and everybody's relying on the same data definitions, that's where Looker (disclosure: where I work) excels.

javiercr10y ago

Take a look to http://www.metabase.com/ (open source, made by Expa, Uber's co-founder incubator).

dacort10y ago

Second this recommendation. Very easy to get up and running - may not be able to handle some complex use cases, but for the basics it's fantastic.

1 more reply

palmeida10y ago

Take a look at http://www.viurdata.com Simple to use product, with transparent pricing and you can use drag & drop or add your own SQL Queries.

Disclaimer: I am one of the founders.

jonbishop10y ago

Hey! I run marketing for Periscope Data, a data explorer/dashboard product. We have a lot of customers using postgres and get compared to Looker a lot. We focus on optimizing for the analyst whereas tools like Looker focus on business users. We have a lot of features for business users, but chart creation is all SQL based.

Our site is here: https://www.periscopedata.com/ and if you have any questions, shoot me an email at jon@periscopedata.com.

Rapzid10y ago

Do you have a self-hosted option? What's your pricing?

1 more reply

qaq10y ago

+1 Having some upfront pricing info would be great

don_draper10y ago· 6 in thread

The code is so clean and simple. This is great PR for the company. I want to work there.

ultimoo10y ago

I think so too!

I have done a fair bit of Ruby a few years ago but I'm new to python CRUD apps and trying to improve my knowledge here. Is defining all models in the same file[1] conventional in python apps? Rails used to have separate files for each model. And most Ruby apps that I have seen advocate the one-class-one-file convention.

[1] https://github.com/airbnb/caravel/blob/master/caravel/models...

tedmiston10y ago

All models in one models.py file is common for Flask and Django.

If you use multiple apps within one Django project or the equivalent in Flask (Blueprints), that extends to one models.py per app (where a "project" is a collection of "apps").

Sometimes you'll see one file model per (with a models/__init__.py that imports them for use). While I think it keep dependency imports for each model very cleanly separated, you end up having a lot of redundancy importing the same basic pieces in every model file.

Keats10y ago

It depends how many models but usually 1 app models == 1 file (taking the app structure from django) unless they are expected to grow.

For example you have a comment app that could contain several models: Comment, Thread, Report, etc those can be in the same file. To continue on the django example, I would personally prefer having a models folder in the comments app and one file per model as some can get really big.

I also do 1 file / model in Flask, minus some specific cases where it just makes sense to have them in the same file

gedrap10y ago

It depends on the project.

If you have a small number of models (e.g. <= 5), then it's fine to have them all in one file, as you will not benefit from multiple files, really.

When your application is growing, you have split the models into multiple files, grouped by features, etc (e.g. users.py, content.py, etc).

I prefer this as models usually a very small, and switching from one file to another can become quickly annoying when working on related models. However, it may be different for large classes.

umhan3510y ago

Not from the number of lines of code. Some of them exceeds 1000.

vermontdevil10y ago

# of lines does not indicate good or poor coding

johnieeboy10y ago· 4 in thread

seems a little bit sparse on the documentation or am I missing something?

caravelOP10y ago

We'll be providing short user training videos very soon.

teej10y ago

As someone who is the core audience for this tool can I say that I strongly prefer clear documentation over videos? Videos are way too hard to maintain and end up being stale the minute after you post them in fast-moving projects. I can't text-search a video and I can't be linked directly to an answer in a StackOverflow response.

Written documentation is vastly superior to videos in my opinion.

2 more replies

dacort10y ago

Would very much appreciate that as well as walk-through docs.

Got it up and running easily enough, and connected to Redshift. But seemed like creating a new "slice" required custom JSON params to define it. Unless I missed something?

edit: yep, missed something. Can "explore" a table by clicking it's link in the table listing.

robroy7210y ago

Is there an API for building viz in code similar to something like Bokeh, or is it end user viz design only?

polskibus10y ago· 2 in thread

I'm wondering - what's the effort required to build such a BI tool (tables, charts, maps) these days assuming reusing open source components and focusing on SQL-speaking datastores? Could a small team of experienced devs accomplish such a feat in a year?

caravelOP10y ago

Yes, look at the commit log. https://github.com/airbnb/caravel/graphs/contributors

polskibus10y ago

Is it covering the very beginning when the caravel wasn't open source?

polartx10y ago· 2 in thread

what are the supported data sources? I saw SQL tables and I imagine flat files, can you write calls to web service endpoints?

wesd10y ago

From the readme file:

Database Support

Caravel was originally designed on top of Druid.io, but quickly broadened its scope to support other databases through the use of SqlAlchemy, a Python ORM that is compatible with most common databases[1].

[1]http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html

tedmiston10y ago

Specifically, SQLAlchemy includes dialects out of the box for: Firebird, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, Sybase.

http://docs.sqlalchemy.org/en/rel_1_0/dialects/index.html

1 more reply

tedmiston10y ago· 1 in thread

I really like the Python style you guys have adopted.

Grouping imports into: standard lib, third party, local is a strong pattern that I don't see done consistently in many repos. Likewise with your use of wrapping long imports with ()s and a single tab.

Any chance of sharing your Python style guide? My startup is Python based (Django and Flask) and would really appreciate it!

kevinastone10y ago

The import ordering is pretty common: https://google.github.io/styleguide/pyguide.html?showone=Imp...

kfk10y ago· 1 in thread

Hey, this is very interesting, I will take a look. I am trying to shift a whole $700m division and then hopefully $3b segment on a new workflow paradigm for data (automatically refreshed dashboards instead of sending around excel files, focus on building models and not data pasting in spreadsheets) and unfortunately for now Tableau is my only option. I feel very uncomfortable going with a closed solution since I know that we will have lots of edge cases and that being able to do your own coding is in the end the best way to deal with those. License cost is also incredibly high, we are talking $200 per user per year at minimum, that means 200 to 400 thousand per year for a big organization between 1000 and 200 users.

dwmintz10y ago

I'm a huge proponent of the idea of centralizing the data model. That's the core idea behind Looker (where I just came to work after 3 years as a customer), and I agree it's a hugely powerful change from the world of everybody-in-their-own spreadsheet.

On your other point, though, to echo the build vs. buy discussion from above, I think it's a bit misleading to say "oh, we'll just use an open-source solution and that'll be cheaper." Because if open source means a couple of internal developers and an analyst, that's easily $300k+/year in salaries that you might not spend if you were using a vendor.

Anyway, given your particular statement of the problem you're facing, I'd humbly suggest you take a look at Looker. The data modeling layer that's core to Looker is meant to solve EXACTLY that problem, by leaving your data where it lives and then embedding your business logic in the layer that sits between end users and the data.

jedisct110y ago· 1 in thread

Wasn't Panoramix already opensource?

caravelOP10y ago

It was (Panoramix got renamed to Caravel), it's just officially supported, maintained an grown by Airbnb now.

arikfr10y ago· 1 in thread

Have you considered using Re:dash[1] before writing your own tool?

[1] https://github.com/getredash/redash

gorkemcetin10y ago

Redash is around for 2.5 years only and probably AirBNB engs thought it was too early to have a look at.

flashman10y ago· 1 in thread

Having some trouble getting it working on Windows, which I see you don't currently support (I need to create caravel_config.py to get past the fabmanager installation step). This looks really interesting but I might wait until someone has posted Windows instructions.

robroy7210y ago

I got it to work on windows using the Anaconda Python 2.7 installation; keep in mind that the caravel commands in the docs have to be run from your install dir; e.g. change dir to

<yourPythonInstallDir>\Lib\site-packages\caravel\bin

then run as

python caravel db upgrade

markovbling10y ago· 1 in thread

Awesome - great work!

A tutorial on how to link it to a mysql database would be greatly appreciated :)

tedmiston10y ago

They're using SQLAlchemy as a database abstraction layer which supports MySQL out of the box.

So, you just need to set the config param SQLALCHEMY_DATABASE_URI like this:

https://github.com/airbnb/caravel/blob/1b4e750b2aa111445703d...

The configuration guide explains it further:

https://github.com/airbnb/caravel/blob/master/docs/installat...

educar10y ago· 1 in thread

Is this like piwik?

teej10y ago

Piwik is more of a Google Analytics replacement. It's a package that contains a data visualization platform, a data storage engine, and a data emitter (website tag) all in one.

This is just a data visualization platform. You need to bring your own data store and data.

dschiptsov10y ago

It is not Java, but Python. What a surprise!

gavin625210y ago

Great looking product! Has anyone figured out how to join tables yet, or do you need to define views in your sql database?

uberneo10y ago

Its Apache licensed .. so it means can i use it directly in my company replacing the existing commercial products like Tableau / Looker?

vhhuhhfryuhgfh10y ago

Any pictures anywhere?

j / k navigate · click thread line to collapse

92 comments

70 comments · 18 top-level

twakefield10y ago· 17 in thread

I'm not knocking Caravel (it looks amazing) just curious why build vs buy in this case.

[1] Tableau, Looker, Periscope, Chartio, Qlikview, Gooddata are just some that come to mind.

caravelOP10y ago

Well none of the solutions mentionned are open source.

Free as in beer is one incentive as licenses are not cheap, and vendors know when they have you locked down and tend to milk everything they can.

Since this is a common need for most companies, it makes sense to have an open source solution that we can all use and collaborate on.

timr10y ago

"Free as in beer is one incentive as licenses are not cheap, and vendors know when they have you locked down and tend to milk everything they can."

4 more replies

phunge10y ago

dwmintz10y ago

Anyway, the ones where bring your own database can scale as far as the database can bring you.

1 more reply

jsiegz10y ago

Looker scales with plenty of data for us at Snapchat, it's more about the underlying database than the BI tool.

1 more reply

adrr10y ago

1 more reply

tlrobinson10y ago

To throw out an open source one: Metabase http://www.metabase.com https://github.com/metabase/metabase

Disclaimer: I work on Metabase.

gedrap10y ago

dwmintz10y ago

caravelOP10y ago

Druid is definitely part of the equation.

Not that building always wins over buying, but the balance shifts relatively to the size of the company.

1 more reply

ernestbro10y ago

There is a study done by a market analyst (Wayne Eckerson) about Building vs Buying BI, he has some good insights based on surveys as to why some companies chose to build and some choose to buy

71% of those who chose to build the BI tools said they built because "We can customize the functionality better"

51% of those who buy say "Buying enables us to provide best-in-class BI functionality"

The study: http://www.jaspersoft.com/sites/default/files/confirmation_f...

ccozan10y ago

It is even crowded than that on the commercial offering [0]. But having such a tool as Open Source benefits us all. Especially when trying to connect to obscure or non standard data sources.

[0] https://en.wikipedia.org/wiki/Online_analytical_processing#M...

ljk10y ago

part of it is probably recruiting

andreasklinger10y ago

I am pretty sure they used multiple (and then most likely at some point too many) of those tools

maerF0x010y ago

Few more for your list: Leftronic (service) or Grafana (package)

hathym10y ago

investor's money, baby!

nedwin10y ago

aesthetics

cauthon10y ago· 7 in thread

What's the name of this style of plot?

https://camo.githubusercontent.com/c22acad6c1302c5da3236cb8e...

simonsarris10y ago

Sankey: http://gojs.net/latest/samples/sankey.html

devy10y ago

I believe it's called "Sankey Diagram" as denoted in the dropdown menu on the upper top left.

Here is the original demo[1] from Mike Bostock, D3's author.

[1]: https://bost.ocks.org/mike/sankey/

nthitz10y ago

Though they have been around much longer than D3 has... https://en.wikipedia.org/wiki/Sankey_diagram

divideby010y ago

I believe it's called a Sankey diagram:

http://bl.ocks.org/d3noob/5028304

RobPfeifer10y ago

Sankey Diagrams. Useful for complex funnels!

bduerst10y ago

We always just called them flow diagrams. I didn't know they had a specific name.

kayhi10y ago

Where is the data from? It's a great overview of CO2 emissions, etc.

mooneater10y ago· 7 in thread

dwmintz10y ago

If you're ok pulling the data out of Postgres into memory locally and mostly care about manipulation and beautiful dataviz, then look at Tableau.

If you're mostly interested in more data sciency/ML stuff, then Shiny or something else that's R-based is a good option.

javiercr10y ago

Take a look to http://www.metabase.com/ (open source, made by Expa, Uber's co-founder incubator).

dacort10y ago

Second this recommendation. Very easy to get up and running - may not be able to handle some complex use cases, but for the basics it's fantastic.

1 more reply

palmeida10y ago

Take a look at http://www.viurdata.com Simple to use product, with transparent pricing and you can use drag & drop or add your own SQL Queries.

Disclaimer: I am one of the founders.

jonbishop10y ago

Our site is here: https://www.periscopedata.com/ and if you have any questions, shoot me an email at jon@periscopedata.com.

Rapzid10y ago

Do you have a self-hosted option? What's your pricing?

1 more reply

qaq10y ago

+1 Having some upfront pricing info would be great

don_draper10y ago· 6 in thread

The code is so clean and simple. This is great PR for the company. I want to work there.

ultimoo10y ago

I think so too!

[1] https://github.com/airbnb/caravel/blob/master/caravel/models...

tedmiston10y ago

All models in one models.py file is common for Flask and Django.

If you use multiple apps within one Django project or the equivalent in Flask (Blueprints), that extends to one models.py per app (where a "project" is a collection of "apps").

Keats10y ago

It depends how many models but usually 1 app models == 1 file (taking the app structure from django) unless they are expected to grow.

I also do 1 file / model in Flask, minus some specific cases where it just makes sense to have them in the same file

gedrap10y ago

It depends on the project.

If you have a small number of models (e.g. <= 5), then it's fine to have them all in one file, as you will not benefit from multiple files, really.

When your application is growing, you have split the models into multiple files, grouped by features, etc (e.g. users.py, content.py, etc).

I prefer this as models usually a very small, and switching from one file to another can become quickly annoying when working on related models. However, it may be different for large classes.

umhan3510y ago

Not from the number of lines of code. Some of them exceeds 1000.

vermontdevil10y ago

# of lines does not indicate good or poor coding

johnieeboy10y ago· 4 in thread

seems a little bit sparse on the documentation or am I missing something?

caravelOP10y ago

We'll be providing short user training videos very soon.

teej10y ago

Written documentation is vastly superior to videos in my opinion.

2 more replies

dacort10y ago

Would very much appreciate that as well as walk-through docs.

Got it up and running easily enough, and connected to Redshift. But seemed like creating a new "slice" required custom JSON params to define it. Unless I missed something?

edit: yep, missed something. Can "explore" a table by clicking it's link in the table listing.

robroy7210y ago

Is there an API for building viz in code similar to something like Bokeh, or is it end user viz design only?

polskibus10y ago· 2 in thread

caravelOP10y ago

Yes, look at the commit log. https://github.com/airbnb/caravel/graphs/contributors

polskibus10y ago

Is it covering the very beginning when the caravel wasn't open source?

polartx10y ago· 2 in thread

what are the supported data sources? I saw SQL tables and I imagine flat files, can you write calls to web service endpoints?

wesd10y ago

From the readme file:

Database Support

[1]http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html

tedmiston10y ago

Specifically, SQLAlchemy includes dialects out of the box for: Firebird, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, Sybase.

http://docs.sqlalchemy.org/en/rel_1_0/dialects/index.html

1 more reply

tedmiston10y ago· 1 in thread

I really like the Python style you guys have adopted.

Any chance of sharing your Python style guide? My startup is Python based (Django and Flask) and would really appreciate it!

kevinastone10y ago

The import ordering is pretty common: https://google.github.io/styleguide/pyguide.html?showone=Imp...

kfk10y ago· 1 in thread

dwmintz10y ago

jedisct110y ago· 1 in thread

Wasn't Panoramix already opensource?

caravelOP10y ago

It was (Panoramix got renamed to Caravel), it's just officially supported, maintained an grown by Airbnb now.

arikfr10y ago· 1 in thread

Have you considered using Re:dash[1] before writing your own tool?

[1] https://github.com/getredash/redash

gorkemcetin10y ago

Redash is around for 2.5 years only and probably AirBNB engs thought it was too early to have a look at.

flashman10y ago· 1 in thread

robroy7210y ago

I got it to work on windows using the Anaconda Python 2.7 installation; keep in mind that the caravel commands in the docs have to be run from your install dir; e.g. change dir to

<yourPythonInstallDir>\Lib\site-packages\caravel\bin

then run as

python caravel db upgrade

markovbling10y ago· 1 in thread

Awesome - great work!

A tutorial on how to link it to a mysql database would be greatly appreciated :)

tedmiston10y ago

They're using SQLAlchemy as a database abstraction layer which supports MySQL out of the box.

So, you just need to set the config param SQLALCHEMY_DATABASE_URI like this:

https://github.com/airbnb/caravel/blob/1b4e750b2aa111445703d...

The configuration guide explains it further:

https://github.com/airbnb/caravel/blob/master/docs/installat...

educar10y ago· 1 in thread

Is this like piwik?

teej10y ago

Piwik is more of a Google Analytics replacement. It's a package that contains a data visualization platform, a data storage engine, and a data emitter (website tag) all in one.

This is just a data visualization platform. You need to bring your own data store and data.

dschiptsov10y ago

It is not Java, but Python. What a surprise!

gavin625210y ago

Great looking product! Has anyone figured out how to join tables yet, or do you need to define views in your sql database?

uberneo10y ago

Its Apache licensed .. so it means can i use it directly in my company replacing the existing commercial products like Tableau / Looker?

vhhuhhfryuhgfh10y ago

Any pictures anywhere?

j / k navigate · click thread line to collapse