Relational Data, Document Databases and Schema Design (opens in new tab)

(paperplanes.de)

37 pointsbinbasti15y ago14 comments

14 comments

10 comments · 2 top-level

gfodor15y ago· 6 in thread

Document databases are a passing fad, a backlash against the flaws in the currently available open source databases. These flaws have nothing to do with the relational model.

If there was an open source RDBMS that allowed instant/lazy schema migrations, direct map-reduce access to its data, was able to automatically partition and shard, and somehow magically had performant joins you would never see these document database come to be so popular. Document databases have no mathematical grounding analagous to the relational model; the reason is because they are a hackneyed application of the relational model that allows implemention to bleed even further into the model than current RDBMses do. For example, the fact a developer now will have to choose between "embedding" or joining documents together application side has absolutely zero different in terms of the true data model. The data model is unchanged in either scenario, it is just being materialized differently on disk for implementation issues.

Edit: To be clear, the way things should work is you should model your data as a normalized, clean model. Then, you should be able to annotate the data or provide alternative views that "embed" documents and so on to reduce cross-table joins. This is an optimization layer and the actual materialization of these views should be maintained by the database. It should be part of the query optimizer, and your joins should simply speed up once they can access the "embedded versions". The fact that you have to do this all manually and store the data yourself, and basically take on the role of the query optimizer as well, shows that we've traded C for assembly code in the db world with these document databases.

It's truly a sad state of affairs that peoples horrible experiences with whatever particular RDBMS they've used are causing them to react by using one of these document databases. Now, they've got a whole series of new problems. There is no free lunch until someone addresses some of the pain points like schema evolution in a way that doesn't sacrifice everything else.

herewego15y ago

I disagree. You are talking about the relational model conceptually. The author is talking about the relational model as it is implemented in today's RDBMS solutions.

You elude to this in the last paragraph of your comment. Having a horrible experience horizontally scaling, for example, is ample reason enough to look into document databases. Largely because the current implementations of document databases allow for ease of partitioning in a way that the current implementations of RDBMS' don't.

The author doesn't imply that document datastores are free lunches. They're simply alternatives and he goes on to say that document datastores are better for some problem domains than others.

Also, you oversimplify the difficulties in implementing a highly-performant document datastore by implying that data should be normalized and that storing the data on disk is an optimization layer. This comment comes off as being written by someone who is ignorant of how document disk storage is implemented and why it is so fast. Doing what you suggest would very likely result in a datastore that was drastically slower than the currently implemented document datastores or even key/value datastores for that matter (MongoDB, CouchDB, Cassandra, etc).

Will we someday perhaps have the end-all-be-all of RDBMS's that merges the feature-set of current RDBMS's and the performance of document/key-value datastores? Maybe, but until then, they're alternatives to each other that generally succeed in providing solutions for the problems for which they were designed:

* RDBMS - consistency, transactions, predictable schema, etc. * Document/Key-Value - horizontal scalability, raw performance, flexible schema, etc.

Different solutions for different problem domains.

ljlolel15y ago

Hi, I think that you are completely right. Schema evolution and easy partition/sharding are broken on current open-source RDBMS systems. Relational data has strong mathematical guarantees as well as useful normalization properties for ensuring consistency.

I have a question about what good solution you have to solve these problems. Are you suggesting that Oracle does this well as a commercial solution? Is anybody at all working on a next-generation RDBMS that makes schema migration truly easy?

For the sharding/partitioning question, is it even possible to make a relational data scheme that stays available and consistent in the case of network partition?

gfodor15y ago

The solution is pretty clear, simply make it easy to do what the author suggests in their article, in a safe and repeatable way. You define your new schema, and you define the tranformation rules to run on the old data. In turn, the database applies these lazily. There'd be some formality necessary to prove certain operations could be applied lazily to provide a consistent view of the data. People are doing this by hand now; it should be part of the system. One can imagine a mathematical model that involves deferred relational operators. The constraints that would be imposed would probably be that certain operations would not be permissable until others complete. For example, it might be necessary, when running one of these lazy migrations, to "flush" the remaining untransformed rows to the end before you are able to apply another one. The upside is your migration scripts run instantly, becuase they have no side effects until the next transaction.

The CAP theorem I think often gets rolled up into these "NoSQL" discussions erroneously. One can still model data relationally but materialize it in a way that reduces ACID compliance to survive in the face of network partitions. Arguably this is a red herring though. In practice, the "network partition" problem to me seems to be only really worth worrying about if you are Google or Amazon and are running cross-data center data storage. For the majority of the world this is not something that needs to be worried about. (See: VoltDB's choice on the matter.)

Edit: By the way, I'm not saying any of this is easy. It's hard, really hard. (I couldn't do it.) But the point here is that it's easy to get caught up and think these 'new' document databases are solving these problems. They're not. They're putting you back down to the assembly code level and forcing you to do these things yourself. The whole point of the RDBMS is to provide an abstraction layer. They're removing this abstraction layer and heralding it as a step forward. It is, in a bizarre sense similar to how it's a "step forward" for you to be able to access the registers on your CPU, but it's surely a step backwards in many ways as well.

1 more reply

dalore15y ago

I disagree, RDF is gaining ground and is getting more popular. In the real world, schema changes are a fact of life and RDF handles it well. RDF also solves the relationship quandary you mention and even allows database (stores) to be distributed and disparate.

mjw15y ago

RDF is based on a highly-normalised data model though--everything is broken down into binary relations.

This is essentially just a rephrasing of the relational model (modulo a relatively simple translation which adds surrogate keys in order to split up >2-ary relations which aren't normalisable any further, into binary relations)

mjw15y ago

Couldn't agree more.

They key point which I think people need to realise is that document databases / key-value stores are essentially lower-level tools than relational databases.

Don't expect a free lunch from choosing lower-level tools. Choose them when you need the extra flexibility / performance / scalability, but be prepared for an increased risk of shooting yourself in the foot, and prepared to implement lots of lower-level pieces of the puzzle, which a high-level tool will handle correctly (if not as speedily) for you.

Rather like choosing C over (say) Python. Although I wouldn't push the analogy tooo far.

wanderr15y ago· 2 in thread

I stopped reading when he said consistency is in your hands (with noSQL). Since you have to manage relationships, and doing so requires several round trips to the datastore, and generally doesn't involve transactions, what happens when inserts 1 and 2 succeed, but insert 3 fails? What happens if you crash before insert 3? In high volume applications, these situations will arise and you will end up with inconsistent data. That doesn't prevent noSQL from being useful in some situations, but it's a far cry from putting control over consistency in your hands.

jchrisa15y ago

CouchDB has full ACID on a single document, so you can avoid these scenarios by writing everything that needs to happen in a transaction into a single document. Modeling your data like this means you are also more robust against partial replications (when using offline mode).

It is possible to do banking this way (CouchDB is probably better for banking than relational systems). Here's an example:

http://books.couchdb.org/relax/reference/recipes

wanderr15y ago

I'm not too familiar with CouchDB in particular or what exactly a "document" is. Let's say I have a site about books. I need to be able to look up books by author, title, isbn and users who have it in their catalogue. Is there a way in CouchDB to ensure consistency while maintaining those relationships?

1 more reply

j / k navigate · click thread line to collapse

14 comments

10 comments · 2 top-level

gfodor15y ago· 6 in thread

Document databases are a passing fad, a backlash against the flaws in the currently available open source databases. These flaws have nothing to do with the relational model.

herewego15y ago

I disagree. You are talking about the relational model conceptually. The author is talking about the relational model as it is implemented in today's RDBMS solutions.

The author doesn't imply that document datastores are free lunches. They're simply alternatives and he goes on to say that document datastores are better for some problem domains than others.

* RDBMS - consistency, transactions, predictable schema, etc. * Document/Key-Value - horizontal scalability, raw performance, flexible schema, etc.

Different solutions for different problem domains.

ljlolel15y ago

For the sharding/partitioning question, is it even possible to make a relational data scheme that stays available and consistent in the case of network partition?

gfodor15y ago

1 more reply

dalore15y ago

mjw15y ago

RDF is based on a highly-normalised data model though--everything is broken down into binary relations.

mjw15y ago

Couldn't agree more.

They key point which I think people need to realise is that document databases / key-value stores are essentially lower-level tools than relational databases.

Rather like choosing C over (say) Python. Although I wouldn't push the analogy tooo far.

wanderr15y ago· 2 in thread

jchrisa15y ago

It is possible to do banking this way (CouchDB is probably better for banking than relational systems). Here's an example:

http://books.couchdb.org/relax/reference/recipes

wanderr15y ago

1 more reply

j / k navigate · click thread line to collapse