Metamarkets Open Sources Druid, A Real-Time Analytics Data Store (opens in new tab)

(metamarkets.com)

71 pointsbrianm13y ago12 comments

12 comments

6 comments · 4 top-level

scanr13y ago· 2 in thread

This is great. I've been keeping an eye on the meta markets blog for entries about Druid because we've been doing something similar and it's always useful to see how other folk are solving a shared problem.

Regarding what's on github: a lot more documentation / examples would make working out how to use it a lot easier.

vosper13y ago

I too have been following these blog posts. I'm happy to see they've open-sourced Druid but I agree that some more documentation and examples would go a long way. Presumably this will be coming in the future, for now it's nice just to be able to grab the code and play.

I'd be interested to hear about how you tackled the realtime analytics problem? We're doing a shootout between HBase, Riak (w/ map-reduce), Hypertable and Postgres at the moment - so far there's no clear winner.

scanr13y ago

We went old school. We have a whole bunch of MySQL instances that we've sharded (using consistent hashing). Queries are sent to all of the shards and then aggregated on one of our query servers.

We also aggregate as we go which means there's just less data around to have to store (the disadvantage being that we need to decide up front what we want to aggregate but that has been less of an issue). We can't do accurate unique counts on pre aggregated data but we've added hyperloglog into the mix as most of our use cases can tolerate a small amount of error.

We partition our MySQL tables, which means we can archive old data easily and don't need to run any repair jobs.

Disadvantages:

* Column oriented is a better fit for this kind of data (somewhat mitigated if the data isn't sparse and we store id's instead of values e.g. city_id rather than city_name).

* Schema changes are not pleasant. We found some deadlock with partitions and running alter table statements which locks up the entire server.

Advantages:

* It performs well enough for our needs

* Having SQL as a query language is very pleasant

* MySQL is well understood, which makes looking after it quite easy

3 more replies

jandrewrogers13y ago

This is great news for the broader community, even if you do not have a use for Druid per se.

Database engines designed for real-time analytics have a significantly different internal structure than either traditional OLTP or popular analytical systems like Hadoop. Most people just try to (badly) fit real-time analytic workloads into a database engine not designed for it. Druid is the first open source example I am aware of that has internals designed for these types of workloads.

Few software engineers know what the inside of a real-time analytical database looks like. This will provide a great starting point. (The only major missing component is the non-trivial, custom I/O scheduling engine required to back these engines to disk instead of in-memory.)

amalag13y ago

Infobright made waves as a column store analytics database. It's performance was awesome when dealing with billions of rows. Having druid use distributed machines with the data in memory is a different approach though.

pyrotechnick13y ago

https://github.com/metamx/druid

j / k navigate · click thread line to collapse