I must have missed the boat on this one. I remember in 2010 there was a brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly to a small set of specialized use cases. There have been some cases where big data systems have replaced more traditional rdbms systems, but now you can use SQL for those too (like Hive SQL).
SQL is the one skill that has not become obsolete in the course of my career. Frankly I've started relying on it more, because it never goes obsolete. Also it's fast as hell. When I first started my career I did C#, and the .net Framework 2 was fairly new. Since then WinForms, and WebForms have gone away. ORM's changed, Javascript changed. Then I moved to Ruby, and Python, and PHP. Those ecosystems have evolved too. But the one thing that I learned 15 years ago, that I still use every day is SQL.
A few weeks ago I had a lead server-side architect call me up and ask if my client could tolerate eventual consistency. My answer was that "he could get away with that 10 years ago but that NoSQL went out of fashion and SQL came back in again."
To put it in a different context: Some people just go through periods in their life where they don't keep up and don't realize what was "new and exciting" 10 years ago turned out to be smoke and mirrors.
As others have said, if you operate at a certain scale and can get away without complex joins and aggregations (which you should avoid anyway if you’re operating a high availability service), then NoSQL is a great option. It’s easier to scale out (often you just turn a knob) and its behavior is simpler and easier to predict.
Since then, developers have been coming up with clever ways to implement distributed joins and transactions more efficiently and scalably on top of distributed database engines, and thus "the return of SQL".
I am probably see as one of these old developers that refuses to get with the times (rather than seeing NoSQL for a step backwards that it is). Its this constant churn of bullshit that makes me want to do a different job after nearly 20 years (as well as the nonsense that is the hiring process).
It reminds me a bit of a paper I saw recently on The Morning Paper called Ironies of Automation: https://blog.acolyer.org/2020/01/08/ironies-of-automation/
That what is not standardized, dies
compared to what? excel?
Data scientist here. I think some of this problem is handled by effective use of views. Oh, everybody is constantly joining these three accounting-related tables and aggregating by, say, order number? Have your Data Engineer/DBA/analyst/whoever create a view that takes care of that. Boom. Now everybody's using the same data, calculated the same way, nobody's reinventing the wheel, and you don't have to worry about somebody fat-fingering something when they re-write that query for the 10th time.
With that being said, I still think there's some truth to this criticism, in that it's not as easy/common to be able to build an abstract query that does a common operation on arbitrary data. You can't import trend_forecast.sql, hand it arbitrary time-series data, and generate an N-month linear forecast from your historical data points. At least, not easily in ANSI SQL.
To the point on views - they can seem really useful, but many people dont understand that views dont eliminate "useless code" - most engines are going to evaluate everything in the view, and the nested subviews, and of course, the 10 depth views that were created because they are such a great abstraction.
This is where (to me) other non-database programming languages definitely come in - you can codify codegen or other methods that are locked in, not just a huge string and get a lot of the same value as your views (unless of course, everyone is querying the db directly.) If that's the case, then templating out code is still a good plan, but you can usually accomplish it via the client tooling people are using (most of the DB IDEs have pretty decent snippet support.)
My strategy for these systems is to first create a view that captures the complete business' desired cardinality (which may be surprisingly complicated), and all the reporting views and stored procedures start with that view as the first table to join off of. Ideally you can implement it with left joins in such a way that allows you to use it everywhere you need any part of the heirarchy and the SQL engine will trim the parts that you don't use in each particular query that uses it.
With this you get the abstraction / reusability that enhances productivity and you get more reliable & reproducible results everywhere. It's much more likely that cross checking to separate reports works when everything is starting with the same cardinality.
For what it's worth, MSSQL, PostgreSQL and Oracle don't have that limitation.
I know plenty of people who are comfortable in both. But I don’t know any “data science” people as good at sql as a good dba. But that’s ok.
I think a good analysis requires both sql and some flat file, if only for reproducibility. I’ve encountered many point in time analyses that can’t be recreated because they were just sql against db and there’s no way to reproduce the result months later. Are their methods accurate? Or auditable? It’s much harder because the source data is changed.
So I like talking about only-sqlers about the importantance of incorporating the data management of extracts and archives into the overall solution. And doing so in a way that doesn’t try to redesign the database trying to avoid the need for users to file away point in time snapshots of data.
I’ve also tried to do things in sql only to realize that I could have done it easier in pandas.
I think it’s usually a mix of both for me, but I try to at least get a decent “tidy” dataset out of sql and then do all the manipulation in pandas because a python pipeline is more portable than a sql pipeline, I think.
I wrote a fairly complex mango query a while back, and trying to understand it now is a nightmare. Compare that to SQL, which is one of the most readable languages out there in my opinion.
I'm now officially a bleeding edge DevOps with 10 years expertise on the brand new "old school" ELT (Extract,Load, Transform).
LoL
Long scripts of SQL with intermediate steps (e.g. building temp tables) are hard to reason about and debug, and especially painful when they fail only in rare production scenarios.
Developers find it a lot easier to reason on an item by item basis, and not on a set or batch basis, and it's very tricky to get error handling and transactions right on a batch basis unless you want to fail the whole batch.
Putting all the work on the database also means you need to scale up that layer. From another POV, it can be simpler to have cheap point lookup, secondary indexes in stores that are tuned for the specifics of the queries, and get performance back by scaling out the compute.
2.Testing for duplicate is a SQL one-liner
3.If your data model is well conceived you can actually unwanted 1-n relationship to be inserted in the first place. Yeah sharding database is coming late to the game but ACID compliance is a killer feature. Code to manage inconsistencies in NoSQL is way less trivial as far as I know... unless it's ok to have an inconsistent database... Guess it depend on use case but ACID seems nice for any important dataset.
(0) https://www.researchgate.net/publication/337336089_Concept-o...
```
It’s better to have 100 functions operate on one data structure than 10 functions on 10 data structures.
---- Alan Perlis
```RDBMS works great up until the point that it doesn't.
Web analytics was one of the first applications for Greenplum. My understanding is that Yahoo collected tens of billions of events per day in the mid-2000s.
> Or when you are trying to build a wide table and you run out of columns.
HAWQ can run SQL queries over Hadoop clusters. Clickhouse's table width is limited by how much RAM you give it.
> Or when your favourite SaaS products gives you highly nested JSON data.
This is why major databases have JSON querying capabilities and why it's been added to the next SQL standard. PostgreSQL even allows you to define indices on fields inside your JSON structures.
Better yet: decompose the highly nested data. Relational databases begin to shine when you get past at least first normal form.
> RDBMS works great up until the point that it doesn't.
RDBMSes do work great until they don't. Which means they are almost always the best solution and almost always remain so.
Folks regularly overestimate the size of their problem and underestimate the capabilities of the literally dozens of RDBMSes now available for use. Yes, it irks me.
Disclosure: I work for VMware, which sponsors Greenplum development.
The good business folks know SQL and aren't afraid of it. I used not to be sure of this until I worked in an organization where most PM types use SQL with comfort.
I created a mini-training for the onboarding process, so they could at least look for things without killing all our read replicas. But that’s just one step; some people are just not good at it.
Fear... fear is having everyone blocked because one person is killing all the replicas for them to work on.
It’s not their job to optimize the database, and it’s no different from seeing that a dev pushed a new feature that causes bad performance as well
That person should reach its quota limits and ask you to help him optimize the query.
Having timely and detailed data available to a wide range of people in the organisation is now seen as a competitive advantage in many industries. There's a lot of tech out there to help with this.
But I could have written this article talking about how companies like Reltio are relying on NoSQL solutions to "empower the enterprise", or how Firebase is allowing startups to not worry about data structures, or how HSBC is deploying blockchain solutions, or how Spark means you can combine data from all different sources. It would still be just as accurate and meaningful. As it is, it just sounds like an infomercial for Looker.
I haven't benchmarked this yet, but after my first experiences with somewhat complex data transformations in numpy and pandas, I was left with the feeling that despite them being optimized, any modern RDBMS would still have run circles around them.
They've been optimized to this kind of stuff for decades, after all.
Remove the assumption that the end result is going to be a row or ten (which usually have good obvious plans with clear indexes), and the query planner is forced to make decisions about choosing the best plan amongst very costly alternatives. The exact costs are sensitive to assumptions about distribution, I/O costs, memory costs etc. and those assumptions need delicate tuning, especially when you don't have hinting as a tool (e.g. Postgres basically doesn't let you give many hints - you can kind of force join order, but the biggest tool is the CTE optimization barrier - I can get 100x speedups by moving subqueries out into CTEs and ensuring they're opaque).
To write efficient big queries, you need to think in terms of data flow and write declarative SQL with an execution plan in mind; it reminds me of my days tweaking source code to encourage a compiler to make certain register allocation decisions.
PG is a bitch here. It's often easier to get good performance by fetching one query, transforming it into a comma-delimited string, and then injecting it into the subsequent query (possibly in batches) to be sure you're getting the plan you need.
MySQL's planner is kind of stupid, but it's predictably stupid and there's lots of hinting tools available. I find it a lot easier to make it work fast for big queries. Sometimes it simply doesn't implement the best strategy (e.g. no hash joins), of course, so sometimes it's not as fast as you could get with PG, but it's usually easier to outperform.
That mirrors my experience with Oracle. The only difference is that there's a hinting mechanism provided which will almost always allow you to force the execution plan you want.
Any column store warehouse (Redshift, Bigquery, Snowflake) is terrific for these kinds of queries, though.
if you don't measure, you're just guessing.
> after my first experiences with somewhat complex data transformations in numpy and pandas
there are fast ways and slow ways to use numpy/pandas, but generally speaking, it's easy to get order of magnitude improvements using pandas compared to an RDBMS.
In pandas, for example, the data from a column is a contiguous in-memory array. Generally RDBMS data is row-oriented, possibly in memory or possibly on disk. Performing some numerical operation on the contiguous array is going to play to the cpu's strengths much better.
This ignorance is especially surprising given that it's essentially a solved problem (Kimball), yet if you talk about data modelling, people usually think regression, not schema.
I'm not sure I would 100% agree with that - e.g., denormalization, while useful for many things, isn't always your best option. But I would say that that there are a lot of tools in the box, and that is absolutely one of the critical ones to know.
I too very like a common interface to various data sources. OctoSQL allows you to use plain SQL to transform, analyse and join data from various datasources, currently including MySQL, PostgreSQL, Redis, JSON files, CSV, Excel.
However, we're very inspired with yesteryears paper "One SQL to rule them all" and should have ergonomic steaming support with Kafka as a new datasource available very soon.
q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).
q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.
Apache MADLib[1] for PostgreSQL and Greenplum. Everything is done inside of SQL; for example, to fit a linear regression model, you execute the query "SELECT madlib.linregr_train(...)".
MLlib[2] for Spark now has a DataFrame based API (spark.ml) for easy integration with Spark SQL.
Every such library I've seen has been a little behind the times, usually offering nothing more advanced than logistic regression, random forest, maybe a multi-layer neural network. I've also never seen a SQL-based ML library that offered GPU training.
For reference, state-of-the-art neural network libraries like TensorFlow or PyTorch support GPU-based training (something like an order of magnitude improvment in training performance) and automatic differentiation (allowing for easy specification of complex models.) For trees, state-of-the-art now includes boosted tree algorithms like XGBoost or CatBoost. Random forest is still a workhorse, but a library which only provides random forest (like the two mentioned above) are a little behind the curve.
It seems that libraries that focus on doing one thing well (ML) are able to offer more features and have an easier time keeping up with times than libraries that also take on the burden of SQL integration. Which isn't to say MADlib and Spark MLlib are bad - not everyone needs state-of-the-art algorithms all the time, and it can be convenient to be able to do so simple things fully inside the SQL environment - but they're never going to be cutting edge, so they're not going to get much attention.
Aside from that, some of the simple models can be fitted using SQL.
It used to be that you needed tools like Informatica and Kimball inspired datamarts, but databases are now bigger and faster. Whatever data modeling problems you may have, you can easily clean up in an ETL or a BI layer, with relatively little effort. This makes tools like Looker, dbt, and Holistics a luxury and not something you need to have. I wish the industry would put more effort into defining clean data models, but I think that ship has sailed. The prevailing trend seems to be to create Data Lakes, add a BI layer, then call it a day.
[edit] Also...some database points. The industry never shifted to using NoSQL to replace RDBMS systems. But event processing matured, NoSQL db's are ideal for storing unstructured data, so you see them in data engineering stacks. Greenplum is a free MPP database that has been around for nearly a decade. The point about Spanner SQL is interesting for the fact that Spanner evolved from NoSQL like methods to a SQL like dialect, but Spanner is a unique flower in the industry, due to being an HTAP db.
But given how many companies have setup data lakes with unstructured and semi-structured data (think SaaS exports) and how SQL layers have largely been unimpressive not sure it's the case.
I guess they target both EL from the ELT, while dbt relies on data being in your data warehouse.
I don’t mean that in a negative way. I mean it in the sense that many of the newest developers don’t know where their cloud based NoSQL database came from (for example). They never were taught the history of what came before, during and after RDBMS. They are only now rediscovering some of what the “old” tools could do.
Many of these developers seem eager to learn, and I am happy to mentor them and teach them the history that I know.
But it has surprised me, it almost feels a little like so many years of waves of marketing and hype maybe actually had a real detriment to teaching people what is real, what is the best tool for the job in different cases, etc.
I have no real evidence other than my anecdotal experience, but this article lends credence to the argument that some never learned or never were given the time to understand the discipline of databases.
Possibly, the discipline of databases and related development has just been continually developing and never settled, so that is why the curriculum hasn’t kept up. But, it really does concern me when a new developer doesn’t even understand what a JOIN is.
Edit: Or even moreso that SQL is an interface to a data engine, and was not necessarily always tightly coupled to relational databases (although it evolved often in lock step with them which is why you see them there more often).
HBase did not - the project has always been very clear that they cater towards a very specific set of use cases - fast writes with little schema constraints, fast single-key and range/fuzzy lookups, not big ETL pipelines.
Even during the rise of Hadoop (everything is a file... I mean file based!) and the subsequent absorption of that into the Public Cloud vendors, SQL has always been there, just wrapped in different tools. These days, someone else hosts it and it's now called Athena instead of Hive, but fundamentally the same thing and has been the same thing.
Even Apache Sparks entire Dataset/Datframe interface yields SQL-like execution plans, exposing the same functions that an RDMBS would, just in Scala/Python/R.
Also, never underestimate the power and speed of a well tuned SQL-family server/cluster, even at surprisingly large scale. A lot of use cases for the older “Hadoop Cluster” type stuff have been overtaken by these approaches. I’ve seen a lot of operations spend silly sums to build ultimately quite clunky Hadoop-based systems when really they probably just needed one half decent SQL admin and a well tuned cluster.
It's so easy and quick to get started, it should be most people's first choice.
https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...
Being able to focus on the relationships without worrying about commas is nice.
I'm still writing the compiler (it's my first, I'm sure it will be awful), but I'm starting from the finish so it's been easy for me to pick up where I left off; I started by deciding the language, then writing the outputs by hand for the tokenizer, my two stages of AST, and the actual SQL.