It's also almost mandatory to read the internal design docs of cassandra even if you're not the admin but just working with it. And modelling data is a lost less trivial than it looks - and almost always not what you assume.
Anyway, this is the best talk I've seen on Cassandra data modeling even if you know Cassandra but not on an expert level. I made sure everyone on my team saw it at least once. https://www.youtube.com/watch?v=qphhxujn5Es
Instead of using a map (or COMPACT STORAGE), they should have defined their schema upfront (one of the limitations brought on by not-SQL). However if they didn't want to do that then COMPACT STORAGE is obviously a better solution than using a Map.
To answer your question about COMPACT STORAGE, standard CQL basically does the work for you (in terms of parsing the "xsv") but you have you define the schema upfront (as in, you have know all the map keys before hand). The reason they tell you not to use COMPACT STORAGE is for those cases where you don't need a dynamic schema, using COMPACT STORAGE doesn't really get you anything.
Lastly, IMO, I wouldn't touch CQL Collections or COMPACT STORAGE unless absolutely necessary. If you do need a dynamic schema I would rather encode the data as a msgpack or protobuf blob.
We tried. It was not only less disk efficient, but also slower for queries. Wasn't the result we expected, but alas.
I've run multi-gigabyte data sets in Cassandra without any config changes. And when I responsible for a 40 node Cassandra cluster we didn't do any tweaks other than a few JVM settings here and there.
Cassandra I have to admit though is very sensitive to how you model and store your database. The whole tombstone saga is never a fun one to go through.
How were you querying 40-nodes or were you offloading the querying to Lucene or something [e.g. like Parsely is]?
If you care, the characters are:
FS -- file separator 1C
GS -- group separator 1D
RS -- record separator 1E
US -- unit separator 1F
As you can see, they also have the benefit of being self-describing (unlike \x01, as the article points out).
(The sad thing is I only learned about these characters because I had to parse files in a 1960s format originally designed to be stored on tape drives -- and they used these delimiters and they worked great.)
IMO, there are ups and downs to using unprintable characters in your protocol. On one hand, yeah, you don't need to worry about someone putting a tab in a TSV field and messing up the format. On the other hand, it can make debugging a lot harder because what you see isn't necessarily what you get, and obviously it becomes harder to hand-write requests.
Of course, OP ended up using unprintable characters anyway, so I think they might as well use the ASCII ones. But even if you know about and use those characters, I think there is still a place for [CT]SV.
I'm glad you brought them up though, because I think the nonprintable ASCII characters occupy a very interesting place in computer science -- (almost) universally supported, but (almost) never used.
“You honestly expected that adopting a data store at your scale would not require you to learn all of its internals?”
These aren't even internals, these are basic facts about Cassandra.And I think it's worth noting that the "serious" things you would do on Cassandra more often than not you couldn't do on most other databases.
Great power, great responsibility.
Glad to answer questions. Ask me anything!
I would be interested in hearing more from someone who has use a third party managed cassandra service like Google' casandra product https://cloud.google.com/solutions/cassandra/. Did you still deep internal knowledge to use Cassandra ? This is important to know for me because my organization will not have the resources to manage Cassandra clusters but we need a Cassandra like store.
I think the article author's problem as that they assumed Cassandra would be similar to the technology they were already familiar with, which was not the case; it's different in so many areas, especially with regard to its performance profile, and to someone used to relational databases it's outright alien.
At the scale they're describing, they do exceed the threshold point at which one has to learn the internals of the technology. In engineering, scale is everything. Riding a bicycle doesn't require that you know how bicycles work, but sending a rocket to the moon demands a lot of knowledge about a wide range of subjects.
I agree it wasn't Tech fault for Kevin Rose allowing VCs to control Digg, but there were issues with Cassandra and especially during the SQL vs NoSQL debate at its highest levels.
The article doesn't even mention that there are a lot of places in the docs where it says: "If you configure a keyspace like this, your node will most likely crash". Not just degraded performance, any develeoper with some access might crash a node (and maybe the whole cluster) with a userspace error.
The main lesson I took away from Cassandra is that battle-tested@Netflix(/other bigcorp) doesn't mean resilient and might require a engineer on standby at all times to run correctly.
This is entirely true. We are big users/consumers/fans of Cassandra at Spotify. We have approximately one crap-ton of Cassandra clusters with several metric crap-tons of data. But we have an entire team that provides support and tooling around Cassandra. We contribute upstream, and have employed core Cassandra contributors in the past.
In return we get a datastore that scales pretty much infinitely with our data sets, has performance characteristics that we are now well aware of and are able to reason about, and provides us cross-DC replication and topology-awareness. It took years to get to this point though, and to build the operational expertise required to run Cassandra. Only recently have we gotten to the point where teams are able to self-service provision their own Cassandra clusters.
This is a resilient, scalable solution, but if I were to quit tomorrow and start a five-person startup, there is no way I would consider C* as a workable solution.
Bad users (read: developers) can break things. If you are worried about bad users, then put an api between users and the datastore to keep them from breaking things.
Parse.ly's customers are media companies so we deal with news / media / headlines all the time, so we were being cute in referencing this practice in the article. I used the headlines as a mechanism to break up the prose, too, and add some levity, since this article weighs in at over 4,000 words.
But I underestimated the ability of techies to scan the headline and paragraph one of an article and say, "not worth my time" and bounce instinctively. In fact, when I tried to submit this post to r/programming, the moderators instantly deleted it, referencing the headline, despite it being a parody.
One thing I shouldn't be surprised by: The Internet adapts!
non-COMPACT store by itself doesn't add any significant overhead (only 2 bytes per cell and even less after compressed). What really wastes space is using collections types. Even using that I doubt it will ever reach the 30x mark the author stated.
For example using Maps other than for a storing a handful of values has always been discouraged.
Why in hell would you think that?
CQL is not SQL? No shit. Actually, the docs pretty clearly spell that out. Anyone with even a cursory knowledge of Cassandra or any big data store would know you need to understand how the system works, and what it's limitations are before you model your data for the system.
COMPACT STORAGE is for backwards compatibility. Turn it on and you are turning off CQL3: http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create... http://www.datastax.com/dev/blog/whats-new-in-cql-3-0
Counters are only usable for things where you don't care if the counter gets increased exactly one time. It is for stuff that doesn't really matter, like "likes" on a page or something.
I'm not even reading "Check Your Row Size: Too Wide, Too Narrow, or Just Right?" because measuring the number of columns max and the ~10MB performance limit is trivial to do and this is a datastore designed by data scientists for people who are willing to do the very little bit of knowledge gain required.
This whole article reads like someone who decided to incorporate Cassandra without actually spending any time learning how to do it correctly before trying to do it.
"He said, “You honestly expected that adopting a data store at your scale would not require you to learn all of its internals?” He has a point."
No shit.
I've deployed C* on some similarly large datasets and encountered none of these issues - even when storing terabytes of time-series data. The difference - I read through the documentation from head to foot before getting started, did numerous dry-runs to figure out the data modeling, and when I wasn't sure about something asked the community (who are usually incredibly quick to respond on IRC, twitter, or elsewhere).
- CQL stands for Cassandra Query Language. As a user of C*, I don't remember anywhere in docs it claiming it conforming to the SQL standard. They do mention it to be "similar to SQL" which is a fair comparison.
As a counter-example of another somewhat similar database, AWS DynamoDB. The absence of a CQL like syntax really frustrates me.
- Data Modelling: Effective data modelling is difficult in all DB systems, inherently more so in NOSQL or non-ACID distributed DBs.
- Counters & Collections: I feel that criticism is legit. I felt similar pains too. I've learned the lesson there not to trust all marketing claims.