With a relational database the complexity is hidden (more or less…) whereas with Big Data and NoSQL the developer needs to deal with this complexity himself/herself. As a result, most of the Big Data applications I’ve seen don’t work well.
A really like Big Data because it’s more complex but to be honest, most of the time my work does not required the “Big Data scale”.
The other thing document databases can offer that relational databases struggle with is taking subsets (which we use for offline sync.) [2]
[1] http://blog.couchbase.com/nosql-adoption-survey-surprises
That is my main issue. I use Cassandra over relational firstly for its linear scalability and multi-master-esque HA. But even ignoring those, I understand exactly what is being scanned and what is not, I don't have to fight with an optimizer at runtime based on several parameters.
Teradata, Oracle, PostgreSQL for example are reasonably complex databases to cluster and manage yourself. Just as easy/hard as setting up HDFS and installing Hive. In all cases people who are at big data scale are buying OTS solutions e.g. Cloudera appliance. They aren't rolling their own.
And if you are using Hive then I can understand why you are not feeling the buzz. But play around for Spark for a while and it's easy to see the future. Being able to write Scala/Python/SQL/R against a data set that can be anywhere from 100MB to 100PB without any changes is pretty compelling.
For instance, you don't typically look at the size of the literals in the query when evaluating query complexity. If it's really unbounded, you probably shouldn't use a relational database.
The "problem" of malloc() function (or any other equivalent allocation stuff) is that internally manages free blocks (it is a middleman between the OS and user process -the purpose is to reduce OS calls-), if you have lots of them, dynamic memory could take time. For "malloc" I meant malloc/realloc/free, the whole kit. Those operations are not free (in most cases you're not going to have millions of allocations in one process, that was just an example of hidden things that could make your algorithm not behave like expected).
Someone told me the same on the article comments and here is the answer I gave him:
You’re right and I agree with you. When I wrote this part, I REALLY hesitated to give the real asymptotic definition and what it means for the number of operations but I chose a simpler explanation since the aim of this post is not to become an expert but to have a good idea. I hope that this won’t mislead people but I thought the real definition was too hard for a “newcomer” and not important to understand a database. This is also why I added in this part “The time complexity doesn’t give the exact number of operations but a good idea.” and said at the end of the part “I didn’t give you the real definition of the big O notation but just the idea” with a link to the real definition.
[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
That's because the inner workings are really old, as in: emerged before blogging etc. was popular, hell before the internet was invented.
In the 'before/early internet days', we read books like 'An introduction to Database Systems' by C.J. Date. (I had to blow the dust off my copy to read the exact title ;)), which are more in depth than this article, but I like the article better, because it's more to the point and easier to understand. Well done!
That's a pretty bold statement...
Very thorough explanations though!
If you liked this article, maybe you'll like my article on Shazam. I used the same pattern: I start from the basics of sound processing and computer science and finish with an in-depth explanation of Shazam.
The thing I love most about database engines is that there is probably more hardcore computer science per line of code than any other software system of similar scope. It is a very rich ecosystem for an algorithms and data structures geek.
http://www.se-radio.net/2013/12/episode-199-michael-stonebra...
[1]: http://www.theregister.co.uk/2015/03/25/mike_stonebraker_win...
It's really understandable, very straight forward, even if a lot of it refers to SQLite v2, it still seems very relevant.