Citus Data (YC S11) Wants To Make Scalable Data Analytics Accessible To Anyone (opens in new tab)

(techcrunch.com)

114 pointsumur14y ago27 comments

27 comments

21 comments · 10 top-level

benbjohnson14y ago· 5 in thread

Distributed SQL queries are cool and accessible to people but I feel like projects that apply relational languages to event data don't make much sense. If I have click stream data then I'm more interested in knowing what users are doing after they performed action "A", "B" & "C" than rolling up how many users performed a single action "A". SQL falls flat on its face for this type of analysis.

Also, the name also threw me off. I thought it was "Citrus" and not "Citus".

[Full disclosure: I am writing an open source, distributed, behavioral database - https://github.com/skylandlabs/sky]

ozgune14y ago

There's value in both types of analyses. For knowing what users do after they perform action "A", "B" & "C", many people currently rely on implementing Map/Reduce programs.

That can be a bit heavyweight if you want to simply compare people who did action "A" or "B", filter based on complex criteria, or apply simple analytic functions. Also, apart from standard relational algebra operators, SQL provides a lot of convenience functions for math operations, string manipulations, date and time formatting, pattern matching, and so forth. These may come in handy to users who want to quickly gather insights out of their data.

benbjohnson14y ago

You're right, there is value from SQL over event data, however, I feel like it's a missed opportunity to simply apply the same paradigms to a different type of data. I'm not suggesting that SQL be thrown out but a new language needs to be available specifically for event data.

bgilroy2614y ago

I'm working outside the bounds of my understanding here, but why can't that result be formed from a query that pulls from the user_history table and a subquery that pulls from the user_history table with different conditions on each one?

benbjohnson14y ago

Self joining is a good thought. It is possible to write a query like this but handling n+ steps (A..B, A..B..C, A..B..C..n) is eventually going to make your query optimizer sh*t all over itself in all likeliness. You're not simply joining two tables but you also have a temporal relationship between each event. For example, if you're looking for users who performed events A then B then C, then you need to self join B to A making sure that all events in B are after A and then all events in C are after B.

Beyond that you have to worry about whether A, B and C are all within the same session. Trying to define a session such as "all events that occurred until there is 30 minutes of idle time" is going to be damn near impossible in the SQL query.

1 more reply

ozgune14y ago

It's doable, but the paradigm doesn't naturally fit into SQL. Users typically need to use sub-selects and self-joins for this kind of "A", "B" & "C" analysis; and that introduces inefficiencies.

2 more replies

smilliken14y ago· 2 in thread

We've been using Citus at MixRank for storing our timeseries data, and it's worked out magnificently well for our use-case.

A few points:

(i) We can do ad-hoc realtime analytics on hundreds of millions of data points.

(ii) We can also do realtime analytics on billions of datapoints as long as we pre-compute along one dimension.

(iii) We could do a lot better at (i) and (ii) if we invested more heavily in hardware (and Citus would make this pretty painless, actually).

(iv) I'd normally not consider a closed-source solution personally, but since Citus is based so heavily on PostgreSQL (protocol-level compatibility, configuration, codebase), this has been a non-issue for us. We can still lean on the amazing PostgreSQL community, documentation, and for the parts we don't have the source code to, the Citus team has been very helpful in explaining how things work.

(v) Fault tolerance is immaculate. At the node level, PostgreSQL is notoriously one of the most reliable and robust databases available. At the cluster level, Citus will magically fall back to a replica mid-query when a server dies.

(vi) Although realtime inserts are not supported out of the box, the system is flexible enough that we were able to get this working on our own without help from Citus.

(vii) Schema migrations are also not supported out of the box, but we built a schema migration framework that takes care of this for us.

(viii) We're not worried about vendor lock-in, since the data is just stored on our servers, in the PostgreSQL serialization format. If we wanted to, we could just give up the features that Citus gives us and build our own data-access layer on top of our cluster.

Anyway, it won't be everything to everyone, but it works very well for our OLAP use-case of timeseries ad impression data. I'd definitely recommend looking into it if you're otherwise considering Hadoop, Vertica, Aster, Greenplum, or a sharded MySQL/PostgreSQL setup.

Full disclosure: I am extremely biased since I've gotten to know the team very well after using Citus. I'm definitely one of their biggest fans, if for no other reason the amount of time they've saved us at MixRank.

nwyc201214y ago

Hi Smilliken Would it be possible to elaborate a bit on how you do real time inserts/updates? I'm interested in trying Citus but would most probably need the realtime feature for production use

smilliken14y ago

I'd be happy to help, but this is probably out of scope of an HN comment. Feel free to reach out to the email in my profile.

johnpmayer14y ago· 2 in thread

This sounds a lot like AsterData Database, which I know has been around for at least a few years. I'm interested to know if you are able to write queries that define explicit parallelism like in the SQL-MapReduce language extension, and also the ability of the query preprocessor for the distributed workload.

ozgune14y ago

Hey, we currently don't have an SQL/MapReduce language extension, but we do have the Map & Reduce execution primitives implemented under the covers (for parallel query processing).

For the distributed query processor, we can efficiently parallelize SQL queries that involve look-ups, complex selections, groupings and orderings, analytic functions, and joins between one large and multiple small tables. We also have a lot more coming; are there any queries that you are particularly interested in?

pella14y ago

any information about PostGis compatibility?

1 more reply

pella14y ago· 2 in thread

"Features Not in v1.0"

http://www.citusdata.com/documentation#missing-features

bsg7514y ago

"Real-time insert, update, or deletes issued against the master node."

Is this a bulk/batch load only system then?

spathak14y ago

That is correct. This is primarily a bulk-load system. There are setups (as mentioned in the smilliken's comment above), where it can be used for real-time inserts, but requires more hands-on setup and configuration.

edouard123456714y ago

Congrats Umur and team. I look forward to trying it for ZeTrip analytics.

kt914y ago

Congratulations on the launch! I've been following this company for a while and its great to see the public launch!

emre14y ago

Congrats to the Citus team! It looks like a great product, will definitely use it!

kolistivra14y ago

Looks very promising, I will definitely give it a shot! Good job!

seboavalin14y ago

Scalable data analytics accessible to anyone? Great!

pinarsezer14y ago

Love the video btw!

j / k navigate · click thread line to collapse

27 comments

21 comments · 10 top-level

benbjohnson14y ago· 5 in thread

Also, the name also threw me off. I thought it was "Citrus" and not "Citus".

[Full disclosure: I am writing an open source, distributed, behavioral database - https://github.com/skylandlabs/sky]

ozgune14y ago

There's value in both types of analyses. For knowing what users do after they perform action "A", "B" & "C", many people currently rely on implementing Map/Reduce programs.

benbjohnson14y ago

bgilroy2614y ago

benbjohnson14y ago

1 more reply

ozgune14y ago

It's doable, but the paradigm doesn't naturally fit into SQL. Users typically need to use sub-selects and self-joins for this kind of "A", "B" & "C" analysis; and that introduces inefficiencies.

2 more replies

smilliken14y ago· 2 in thread

We've been using Citus at MixRank for storing our timeseries data, and it's worked out magnificently well for our use-case.

A few points:

(i) We can do ad-hoc realtime analytics on hundreds of millions of data points.

(ii) We can also do realtime analytics on billions of datapoints as long as we pre-compute along one dimension.

(iii) We could do a lot better at (i) and (ii) if we invested more heavily in hardware (and Citus would make this pretty painless, actually).

(vi) Although realtime inserts are not supported out of the box, the system is flexible enough that we were able to get this working on our own without help from Citus.

(vii) Schema migrations are also not supported out of the box, but we built a schema migration framework that takes care of this for us.

nwyc201214y ago

Hi Smilliken Would it be possible to elaborate a bit on how you do real time inserts/updates? I'm interested in trying Citus but would most probably need the realtime feature for production use

smilliken14y ago

I'd be happy to help, but this is probably out of scope of an HN comment. Feel free to reach out to the email in my profile.

johnpmayer14y ago· 2 in thread

ozgune14y ago

Hey, we currently don't have an SQL/MapReduce language extension, but we do have the Map & Reduce execution primitives implemented under the covers (for parallel query processing).

pella14y ago

any information about PostGis compatibility?

1 more reply

pella14y ago· 2 in thread

"Features Not in v1.0"

http://www.citusdata.com/documentation#missing-features

bsg7514y ago

"Real-time insert, update, or deletes issued against the master node."

Is this a bulk/batch load only system then?

spathak14y ago

edouard123456714y ago

Congrats Umur and team. I look forward to trying it for ZeTrip analytics.