The Rise of SQL-Based Data Modeling (opens in new tab)

(holistics.io)

192 pointshuy6y ago125 comments

125 comments

97 comments · 26 top-level

swalsh6y ago· 14 in thread

" Up till a few years ago, the traditional way of managing data (in SQL-based RDBMSs) was considered a relic of the past, as these systems couldn't scale to cope with such a huge amount of data. "

I must have missed the boat on this one. I remember in 2010 there was a brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly to a small set of specialized use cases. There have been some cases where big data systems have replaced more traditional rdbms systems, but now you can use SQL for those too (like Hive SQL).

SQL is the one skill that has not become obsolete in the course of my career. Frankly I've started relying on it more, because it never goes obsolete. Also it's fast as hell. When I first started my career I did C#, and the .net Framework 2 was fairly new. Since then WinForms, and WebForms have gone away. ORM's changed, Javascript changed. Then I moved to Ruby, and Python, and PHP. Those ecosystems have evolved too. But the one thing that I learned 15 years ago, that I still use every day is SQL.

gwbas1c6y ago

> I must have missed the boat on this one. I remember in 2010 there was a brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly to a small set of specialized use cases. There have been some cases where big data systems have replaced more traditional rdbms systems, but now you can use SQL for those too (like Hive SQL).

A few weeks ago I had a lead server-side architect call me up and ask if my client could tolerate eventual consistency. My answer was that "he could get away with that 10 years ago but that NoSQL went out of fashion and SQL came back in again."

To put it in a different context: Some people just go through periods in their life where they don't keep up and don't realize what was "new and exciting" 10 years ago turned out to be smoke and mirrors.

pdexter6y ago

What's the connection between NoSQL vs SQL and eventual consistency?

3 more replies

cle6y ago

NoSQL is very much alive and kicking, it’s just past the peak in the hype curve. There is a FAANG in particular who is notorious for banning relational databases outright because of their maintenance and scaling problems for their use cases.

As others have said, if you operate at a certain scale and can get away without complex joins and aggregations (which you should avoid anyway if you’re operating a high availability service), then NoSQL is a great option. It’s easier to scale out (often you just turn a knob) and its behavior is simpler and easier to predict.

2 more replies

jimbokun6y ago

The nugget of truth was that joins and transactions are very difficult to implement in a performant manner in large scale, distributed systems. So it was possible to write very highly scalable databases if you give up on "SQL", which was a stand in for joins and transactions.

Since then, developers have been coming up with clever ways to implement distributed joins and transactions more efficiently and scalably on top of distributed database engines, and thus "the return of SQL".

6gvONxR4sf7o6y ago

Where can I learn more about these clever developments?

2 more replies

collyw6y ago

The youngsters in my work don't really know SQL.

I am probably see as one of these old developers that refuses to get with the times (rather than seeing NoSQL for a step backwards that it is). Its this constant churn of bullshit that makes me want to do a different job after nearly 20 years (as well as the nonsense that is the hiring process).

mumblemumble6y ago

This, I think, is the true cost of technologies like ORM. And it's hard to notice, because it doesn't show up right away; it's a generational thing.

It reminds me a bit of a paper I saw recently on The Morning Paper called Ironies of Automation: https://blog.acolyer.org/2020/01/08/ironies-of-automation/

larrydag6y ago

Unfortunately for this data analyst NoSQL is alive and well. It is being rolled out to enterprise levels without the downstream impacts fully realized. This requires all sorts of ad-hoc ETL data methods which in turn puts more time and strain on budgets and resources. At least this is true in my organization.

baldfat6y ago

I think there is much more non-SQL then the OP realizes. I have been protesting non-SQL, SQL is more powerful than most realize. Also the term "Big Data" is being used 99% of the time about data bases I could load into my ram but counted in millions. This then causes people to search for things and find non-SQL.

2 more replies

exabrial6y ago

> brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly

That what is not standardized, dies

oyoun6y ago

I totally agree with you. And SQL keep evolving

pastor_elm6y ago

So long as fresh faced developers arrive on the scene, there will always be people looking to get around using SQL.

oyoun6y ago

And database could scale up at this time: Pan-STARRS is one of the bigger database with PB

jstrong6y ago

> Also it's fast as hell

compared to what? excel?

dbatten6y ago· 8 in thread

> This implies that SQL is not reusable, causing similar code with slightly different logic to be repeated all over the place. For example, one cannot easily write a SQL ‘library’ for accounting purposes and distribute it to the rest of your team for reuse whenever accounting-related analysis is required.

Data scientist here. I think some of this problem is handled by effective use of views. Oh, everybody is constantly joining these three accounting-related tables and aggregating by, say, order number? Have your Data Engineer/DBA/analyst/whoever create a view that takes care of that. Boom. Now everybody's using the same data, calculated the same way, nobody's reinventing the wheel, and you don't have to worry about somebody fat-fingering something when they re-write that query for the 10th time.

With that being said, I still think there's some truth to this criticism, in that it's not as easy/common to be able to build an abstract query that does a common operation on arbitrary data. You can't import trend_forecast.sql, hand it arbitrary time-series data, and generate an N-month linear forecast from your historical data points. At least, not easily in ANSI SQL.

hobs6y ago

imo any SQL that requires regression (I have implemented least squares in TSQL w/ recursive CTEs) is probably a bad use case for SQL in general - projecting, filtering, grouping, and sorting is what SQL engines are great at - repeating row access or referencing the previous row on the next row and then running a function is not going to be fast in most of the databases.

To the point on views - they can seem really useful, but many people dont understand that views dont eliminate "useless code" - most engines are going to evaluate everything in the view, and the nested subviews, and of course, the 10 depth views that were created because they are such a great abstraction.

This is where (to me) other non-database programming languages definitely come in - you can codify codegen or other methods that are locked in, not just a huge string and get a lot of the same value as your views (unless of course, everyone is querying the db directly.) If that's the case, then templating out code is still a good plan, but you can usually accomplish it via the client tooling people are using (most of the DB IDEs have pretty decent snippet support.)

dbatten6y ago

Agreed on regression. You should probably be doing that level of analytical work in a stats package. Perhaps I should have chosen a better example to illustrate the "SQL doesn't facilitate creating abstract libraries that operate on arbitrary data" point.

infogulch6y ago

I've found views views wildly useful for defining a single definition of a complex heirarchy defined by business rules. Especially for reporting in 'pre-packaged' software with a fixed table structure. Often in these generic CRM or ERP systems it's surprisingly hard to actually get at the root cardinally-1 data and it's 1-n categories.

My strategy for these systems is to first create a view that captures the complete business' desired cardinality (which may be surprisingly complicated), and all the reporting views and stored procedures start with that view as the first table to join off of. Ideally you can implement it with left joins in such a way that allows you to use it everywhere you need any part of the heirarchy and the SQL engine will trim the parts that you don't use in each particular query that uses it.

With this you get the abstraction / reusability that enhances productivity and you get more reliable & reproducible results everywhere. It's much more likely that cross checking to separate reports works when everything is starting with the same cardinality.

tenken6y ago

Are you saying you build a completely de-normalized View of data? Could you please provide a small example?

wodenokoto6y ago

My problem with views is that you do the join before the filtering.

solidangle6y ago

Any query optimizer worth its salt will push the filters below the join.

mumblemumble6y ago

I'm curious, what DBMS are you using whose optimizer can't see through views?

For what it's worth, MSSQL, PostgreSQL and Oracle don't have that limitation.

1 more reply

buremba6y ago

Predicate pushdown is one of the first optimizations implemented in SQL planners and all the OLAP and OLTP solutions that I'm aware of already have this feature in place.

trollied6y ago· 8 in thread

Don't really get this article. SQL never went away or got less popular. Always been in the background doing its awesome thing.

mnky9800n6y ago

In data science it has been an uphill battle between the people who only think csv exists and those who switch to a new nosql flavor every 6 months

prepend6y ago

I think it’s artificially presented as an exclusive option between sql and csv/nosql/whatever in data science.

I know plenty of people who are comfortable in both. But I don’t know any “data science” people as good at sql as a good dba. But that’s ok.

I think a good analysis requires both sql and some flat file, if only for reproducibility. I’ve encountered many point in time analyses that can’t be recreated because they were just sql against db and there’s no way to reproduce the result months later. Are their methods accurate? Or auditable? It’s much harder because the source data is changed.

So I like talking about only-sqlers about the importantance of incorporating the data management of extracts and archives into the overall solution. And doing so in a way that doesn’t try to redesign the database trying to avoid the need for users to file away point in time snapshots of data.

overcast6y ago

CSV/Excel has run the world since the beginning, that's going to be a never ending battle. Large orgs are still entrenched into SQL when their enterprise ERP/MRP systems run off Oracle, and their analysts are extracting reports to Excel.

1 more reply

d4nt6y ago

I agree, NoSQL is still working it’s way along the hype curve but a huge amount of business data still lives in relational databases. The fact that there’s money to be made supporting this market is not big news.

reilly30006y ago

I think the big contender has been dataframe processing using Pandas, R and others. These have their own methods for doing the type of data munging and analysis that has been the traditional territory of SQL. I won’t soon forget the half day I plowed into getting the right Pandas pipeline dialed in only to realize I could have written a better SQL query to fetch the correct data to begin with.

prepend6y ago

I’m sure there will be an xkcd about this soon.

I’ve also tried to do things in sql only to realize that I could have done it easier in pandas.

I think it’s usually a mix of both for me, but I try to at least get a decent “tidy” dataset out of sql and then do all the manipulation in pandas because a python pipeline is more portable than a sql pipeline, I think.

angry_cactus6y ago

Google Trends confirms, even MongoDB is only a fraction of SQL.

collyw6y ago

Thank god. I am working with it now and its crap in comparison.

I wrote a fairly complex mango query a while back, and trying to understand it now is a nightmare. Compare that to SQL, which is one of the most readable languages out there in my opinion.

1 more reply

Twisell6y ago· 6 in thread

Being a total stubborn asshole to my boss over last decade because I refused to even investigate about how we could replace our old and "complicated" SQL integration/export workflows with a modern and "intuitive" visual proprietary ETL this article is a real relief for me!

I'm now officially a bleeding edge DevOps with 10 years expertise on the brand new "old school" ELT (Extract,Load, Transform).

LoL

barrkel6y ago

Complicated SQL is definitely complicated, though. Get an expected 1:1 join wrong (so it ends up 1:n) and duplicates muck things up; and typically unit tests don't catch these kinds of errors because the test set is minimal.

Long scripts of SQL with intermediate steps (e.g. building temp tables) are hard to reason about and debug, and especially painful when they fail only in rare production scenarios.

Developers find it a lot easier to reason on an item by item basis, and not on a set or batch basis, and it's very tricky to get error handling and transactions right on a batch basis unless you want to fail the whole batch.

Putting all the work on the database also means you need to scale up that layer. From another POV, it can be simpler to have cheap point lookup, secondary indexes in stores that are tuned for the specifics of the queries, and get performance back by scaling out the compute.

Twisell6y ago

1.If you rely on an ETL to automagically remove duplicates you are lying to yourself because you treat external symptoms, the illness is the underlying query and hiding this fact can only lead to more headaches later on.

2.Testing for duplicate is a SQL one-liner

3.If your data model is well conceived you can actually unwanted 1-n relationship to be inserted in the first place. Yeah sharding database is coming late to the game but ACID compliance is a killer feature. Code to manage inconsistencies in NoSQL is way less trivial as far as I know... unless it's ok to have an inconsistent database... Guess it depend on use case but ACID seems nice for any important dataset.

1 more reply

dkersten6y ago

Hah. I guess that happens semi regularly in our industry, that by being stubborn and sticking with old technologies you may one day wake up and find out its suddenly bleeding edge again (kind of, in some form, anyway)

pjmlp6y ago

Like writing microservices in Cobol on Amazon Cloud. :)

https://www.infoq.com/articles/migrate-mainframe-aws/

pjmlp6y ago

Yeah, that is why I long learned to be skeptical in a pop culture driven industry, always looking after the new fad to sell stuff.

kingludite6y ago

Like that it's funny to be a human. When all features have been added its so complicated one might as well start from scratch.... Wait I can put it even worse: What the Father Does, the Son Does - John 5:19-23

lincpa6y ago· 6 in thread

[Everything is RMDB](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)

conceptoriented6y ago

The concept-oriented model tries to overcome some problems of RM by relying on two constructs: sets and functions. In contrast, RM uses only sets. The idea is that data can be stored in functions and transformed via operations with functions.

(0) https://www.researchgate.net/publication/337336089_Concept-o...

lincpa6y ago

In `Everything is RMDB`, Data is dominant, It combines the advantages of RMDB and NoSQL. and it emphasizes the following points:

```

It’s better to have 100 functions operate on one data structure than 10 functions on 10 data structures.

        ---- Alan Perlis

```

threeseed6y ago

Except when you have large, telemetry style datasets e.g. web/product analytics which won't fit. Or when you are trying to build a wide table and you run out of columns. Or when your favourite SaaS products gives you highly nested JSON data.

RDBMS works great up until the point that it doesn't.

jacques_chester6y ago

> Except when you have large, telemetry style datasets e.g. web/product analytics which won't fit.

Web analytics was one of the first applications for Greenplum. My understanding is that Yahoo collected tens of billions of events per day in the mid-2000s.

> Or when you are trying to build a wide table and you run out of columns.

HAWQ can run SQL queries over Hadoop clusters. Clickhouse's table width is limited by how much RAM you give it.

> Or when your favourite SaaS products gives you highly nested JSON data.

This is why major databases have JSON querying capabilities and why it's been added to the next SQL standard. PostgreSQL even allows you to define indices on fields inside your JSON structures.

Better yet: decompose the highly nested data. Relational databases begin to shine when you get past at least first normal form.

> RDBMS works great up until the point that it doesn't.

RDBMSes do work great until they don't. Which means they are almost always the best solution and almost always remain so.

Folks regularly overestimate the size of their problem and underestimate the capabilities of the literally dozens of RDBMSes now available for use. Yes, it irks me.

Disclosure: I work for VMware, which sponsors Greenplum development.

buremba6y ago

RDBMS doesn't work but columnar databases such as BigQuery, Snowflake, and Presto make it work. I agree that web/product analytics data is hard to get it done the right way thus people rely on specialized solutions such as Mixpanel and Amplitude but I believe that as companies started to be more privacy-concerned and want to fill the gap in between the product and other company data such as marketing, finance, etc, it will change over time. See how it's possible to run funnel, retention, segmentation queries in Firebase Analytics data on BigQuery with SQL: https://www.youtube.com/watch?v=7tXVG785_0M

zip12346y ago

Even Sqlite has json support these days...

harshaw6y ago· 4 in thread

This quote irks me: "Any tool that relies heavily on SQL is off-limits to business users who want more than viewing static charts and dashboards but don't know the language"

The good business folks know SQL and aren't afraid of it. I used not to be sure of this until I worked in an organization where most PM types use SQL with comfort.

IpV86y ago

Yea I'm not sure what the alternative would be. You can do some drag and drop dashboard creation or export small data sets to excel, but at some point you gotta get to SQL if you want all of the expressability.

anbotero6y ago

know SQL is kind of hard to describe. I also have business people that know SQL, yet don't use (or request, if they knew there are not) indexes. So we get this amalgamation of monstrous subqueries all hitting the same unoptimized fields, of course, blocking the database.

I created a mini-training for the onboarding process, so they could at least look for things without killing all our read replicas. But that’s just one step; some people are just not good at it.

Fear... fear is having everyone blocked because one person is killing all the replicas for them to work on.

nemothekid6y ago

That sounds more like an ops problem than a business one. I’ve worked with a number of a non-technical roles that dabble in SQL, and it’s almost always they are querying a data warehousing solution (e.g Redshift, Pig, BigQuery) that doesn’t support indexes to begin with and most times just have to live with the performance.

It’s not their job to optimize the database, and it’s no different from seeing that a dev pushed a new feature that causes bad performance as well

1 more reply

xiphias26y ago

The solution to this problem should be setting up resource quotas.

That person should reach its quota limits and ask you to help him optimize the query.

1 more reply

blowski6y ago· 4 in thread

I don't really understand what this article is trying to say, other than "isn't Looker great!".

Having timely and detailed data available to a wide range of people in the organisation is now seen as a competitive advantage in many industries. There's a lot of tech out there to help with this.

But I could have written this article talking about how companies like Reltio are relying on NoSQL solutions to "empower the enterprise", or how Firebase is allowing startups to not worry about data structures, or how HSBC is deploying blockchain solutions, or how Spark means you can combine data from all different sources. It would still be just as accurate and meaningful. As it is, it just sounds like an infomercial for Looker.

contravariant6y ago

The weirdest part to me is that they went from 'Everyone is back to using SQL because it's good at querying and you don't need to learn a proprietary language' to 'Let's use LookML instead'.

buremba6y ago

Correct but the problem with raw SQL is that it's not composable easily. LookML essentially provides that composability on top of SQL.

2 more replies

willvarfar6y ago

Its a blogvertisment by Holistics, which is a competitor to Looker.

blowski6y ago

Oh, I've never heard of them, thanks for the heads up. Explains why it reads like an advert - because it is.

1 more reply

ckastner6y ago· 4 in thread

The article doesn't mention performance.

I haven't benchmarked this yet, but after my first experiences with somewhat complex data transformations in numpy and pandas, I was left with the feeling that despite them being optimized, any modern RDBMS would still have run circles around them.

They've been optimized to this kind of stuff for decades, after all.

barrkel6y ago

Most modern RDBMS are optimized for point-lookup queries and have a very hard time choosing good plans when you're selecting millions of rows amongst billions.

Remove the assumption that the end result is going to be a row or ten (which usually have good obvious plans with clear indexes), and the query planner is forced to make decisions about choosing the best plan amongst very costly alternatives. The exact costs are sensitive to assumptions about distribution, I/O costs, memory costs etc. and those assumptions need delicate tuning, especially when you don't have hinting as a tool (e.g. Postgres basically doesn't let you give many hints - you can kind of force join order, but the biggest tool is the CTE optimization barrier - I can get 100x speedups by moving subqueries out into CTEs and ensuring they're opaque).

To write efficient big queries, you need to think in terms of data flow and write declarative SQL with an execution plan in mind; it reminds me of my days tweaking source code to encourage a compiler to make certain register allocation decisions.

PG is a bitch here. It's often easier to get good performance by fetching one query, transforming it into a comma-delimited string, and then injecting it into the subsequent query (possibly in batches) to be sure you're getting the plan you need.

MySQL's planner is kind of stupid, but it's predictably stupid and there's lots of hinting tools available. I find it a lot easier to make it work fast for big queries. Sometimes it simply doesn't implement the best strategy (e.g. no hash joins), of course, so sometimes it's not as fast as you could get with PG, but it's usually easier to outperform.

keanzu6y ago

> To write efficient big queries, you need to think in terms of data flow and write declarative SQL with an execution plan in mind; it reminds me of my days tweaking source code to encourage a compiler to make certain register allocation decisions.

That mirrors my experience with Oracle. The only difference is that there's a hinting mechanism provided which will almost always allow you to force the execution plan you want.

1 more reply

oarabbus_6y ago

>Most modern RDBMS are optimized for point-lookup queries and have a very hard time choosing good plans when you're selecting millions of rows amongst billions.

Any column store warehouse (Redshift, Bigquery, Snowflake) is terrific for these kinds of queries, though.

jstrong6y ago

> I haven't benchmarked this yet

if you don't measure, you're just guessing.

> after my first experiences with somewhat complex data transformations in numpy and pandas

there are fast ways and slow ways to use numpy/pandas, but generally speaking, it's easy to get order of magnitude improvements using pandas compared to an RDBMS.

In pandas, for example, the data from a column is a contiguous in-memory array. Generally RDBMS data is row-oriented, possibly in memory or possibly on disk. Performing some numerical operation on the contiguous array is going to play to the cpu's strengths much better.

cjf46y ago· 3 in thread

One of the things I'm constantly baffled by given the growth of data science and analytics is that data modelling isn't treated as a first class concern. It's absolutely fundamental to doing quantitative work efficiently at an organization with any amount of complexity, yet the majority of people in the field seem to be unaware of the concepts.

This ignorance is especially surprising given that it's essentially a solved problem (Kimball), yet if you talk about data modelling, people usually think regression, not schema.

conceptoriented6y ago

I agree that data modeling is underestimated but it hardly can be considered a solved problem. It is very hard because there are numerous alternative understandings and formal defs of what we mean by data (RM, 00, OR, MD etc.) In additiin, there are several levels of representation (physical, logical, semantic). In real projects, they all are mixed.

boublepop6y ago

Could you elaborate on why you think data-modeling is a solved problem? It’s seems like an opinion warranting more than a one word explanation.

mumblemumble6y ago

I'm guessing that the implication was that Kimball's dimensional modeling techniques (so, in a nutshell, snowflake and star schemata) will get the job done in almost any case.

I'm not sure I would 100% agree with that - e.g., denormalization, while useful for many things, isn't always your best option. But I would say that that there are a lot of tools in the box, and that is absolutely one of the critical ones to know.

2 more replies

cube22226y ago· 3 in thread

Hey, just wanted to chime with a tool I'm a co-author of, OctoSQL[1].

I too very like a common interface to various data sources. OctoSQL allows you to use plain SQL to transform, analyse and join data from various datasources, currently including MySQL, PostgreSQL, Redis, JSON files, CSV, Excel.

However, we're very inspired with yesteryears paper "One SQL to rule them all" and should have ergonomic steaming support with Kafka as a new datasource available very soon.

[1]: https://github.com/cube2222/octosql

keanzu6y ago

q - Text as Data

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

http://harelba.github.io/q/

marvel_boy6y ago

Do you plan support for Sqlite?

cube22226y ago

I see no reason to not add it, but we're currently busy so it'll have to wait till we either have time or somebody contributes it. Should be an hour of copypasting one of the sql data sources and adapting it for sqlite.

exabrial6y ago· 3 in thread

Does anyone have any ML Packages that directly integrate with SQL Databases? I feel like the ETL market is well covered with a myriad of tools, but our data scientists at the end of the day still want to extract giant CSVs b/c their python progs "just don't work natively" on an RDBMS (they probably do, but it's not the way it's done, and these guys aren't programmers for their first job anyway).

olooney6y ago

They exist, they're just not really state-of-the-art.

Apache MADLib[1] for PostgreSQL and Greenplum. Everything is done inside of SQL; for example, to fit a linear regression model, you execute the query "SELECT madlib.linregr_train(...)".

MLlib[2] for Spark now has a DataFrame based API (spark.ml) for easy integration with Spark SQL.

Every such library I've seen has been a little behind the times, usually offering nothing more advanced than logistic regression, random forest, maybe a multi-layer neural network. I've also never seen a SQL-based ML library that offered GPU training.

For reference, state-of-the-art neural network libraries like TensorFlow or PyTorch support GPU-based training (something like an order of magnitude improvment in training performance) and automatic differentiation (allowing for easy specification of complex models.) For trees, state-of-the-art now includes boosted tree algorithms like XGBoost or CatBoost. Random forest is still a workhorse, but a library which only provides random forest (like the two mentioned above) are a little behind the curve.

It seems that libraries that focus on doing one thing well (ML) are able to offer more features and have an easier time keeping up with times than libraries that also take on the burden of SQL integration. Which isn't to say MADlib and Spark MLlib are bad - not everyone needs state-of-the-art algorithms all the time, and it can be convenient to be able to do so simple things fully inside the SQL environment - but they're never going to be cutting edge, so they're not going to get much attention.

[1] https://madlib.apache.org/index.html

[2] http://spark.apache.org/docs/latest/ml-guide.html

sixdimensional6y ago

SQL Server has had standard ML functions in SSAS for years “SQL Server Data Mining” extensions. In fact, I recall it may have been the first relational database platform to actually do this. It is obviously a commercial DB though.

mmeasic6y ago

I believe that more and more providers will offer ML inside the database. You can check what Oracle is doing.

Aside from that, some of the simple models can be fitted using SQL.

iblaine6y ago· 2 in thread

The problem with modeling data isn't the lack of tools or the need to approach this problem differently. Plenty of solutions exist. People just don't care, and this is ok.

It used to be that you needed tools like Informatica and Kimball inspired datamarts, but databases are now bigger and faster. Whatever data modeling problems you may have, you can easily clean up in an ETL or a BI layer, with relatively little effort. This makes tools like Looker, dbt, and Holistics a luxury and not something you need to have. I wish the industry would put more effort into defining clean data models, but I think that ship has sailed. The prevailing trend seems to be to create Data Lakes, add a BI layer, then call it a day.

[edit] Also...some database points. The industry never shifted to using NoSQL to replace RDBMS systems. But event processing matured, NoSQL db's are ideal for storing unstructured data, so you see them in data engineering stacks. Greenplum is a free MPP database that has been around for nearly a decade. The point about Spanner SQL is interesting for the fact that Spanner evolved from NoSQL like methods to a SQL like dialect, but Spanner is a unique flower in the industry, due to being an HTAP db.

huyOP6y ago

Can you elaborate more on what you meAn by “creating clean data models”?

iblaine6y ago

Using Inmon or Kimball's approach to modeling data structures is clean. The thing with modeling data is there's no one right way but there are many wrong ways.

threeseed6y ago· 2 in thread

Just a slight conflict of interest having a company that makes SQL-based data modelling tools telling us that SQL-based data modelling in on the rise.

But given how many companies have setup data lakes with unstructured and semi-structured data (think SaaS exports) and how SQL layers have largely been unimpressive not sure it's the case.

tkyjonathan6y ago

And a lot of those lakes remain untouched for years..

EdwardDiego6y ago

It's not too far from datalake to dataswamp.

oneofthose6y ago· 1 in thread

DBT (data build tool) [0] embraces this idea - it's like make for data transformation. Just like make its syntax is sub-optimal. But that's the only draw-back. There is an open source version, it generates documentation automatically, you can materialize tables on a schedule, it allows you to write unit tests for your data ("this select must return 0 rows" kind of tests). I'm not affiliated with them, just happy user.

[0] https://www.getdbt.com/

mmeasic6y ago

Yes, that was my first thought when I opened the article. dbt is amazing and basically converts your SELECT statements to CREATE TABLE statements.

I guess they target both EL from the ELT, while dbt relies on data being in your data warehouse.

__ian__6y ago· 1 in thread

BigQuery is good. Looker is OK. This reads like paid-for PR content.

keanzu6y ago

It's a blog page on holistics.io so it is paid-for PR content. Or a blogvert.

sashavingardt26y ago· 1 in thread

This article was written by yet another kid a few years out of college who doesn't know the history, the tooling and thinks he and his teammates are providing a solution to an existing problem. What he doesn't realize is that SQL reusability and data modeling have been solved 20 years ago.

jszymborski6y ago

Have you seen their tool? Without having used it, it looks pretty slick. I think you're being a little too dismissive, considering "It's All Been Done Before(TM)" and that they seem to be building a rather nice UI for BI... a business with multiple players all making a pretty penny.

IpV86y ago· 1 in thread

I'm surprised to see only a cursory mention of Snowflake in this article. In my experience, they are really the pioneers of the new distributed, cloud-first database. They really enabled large scale relational data warehouses, and are still miles ahead of even the big cloud players.

edmundsauto6y ago

What is it you like about them? IIRC, their base plan starts at $1500/mo, so it's a high end offering that I haven't played with too much.

sixdimensional6y ago

It’s a little bit off topic, but the tone of this discussion brings up an interesting conversation I’ve been having with some of my colleagues in the same age group - a perceived skills gap between the newest developers and ourselves.

I don’t mean that in a negative way. I mean it in the sense that many of the newest developers don’t know where their cloud based NoSQL database came from (for example). They never were taught the history of what came before, during and after RDBMS. They are only now rediscovering some of what the “old” tools could do.

Many of these developers seem eager to learn, and I am happy to mentor them and teach them the history that I know.

But it has surprised me, it almost feels a little like so many years of waves of marketing and hype maybe actually had a real detriment to teaching people what is real, what is the best tool for the job in different cases, etc.

I have no real evidence other than my anecdotal experience, but this article lends credence to the argument that some never learned or never were given the time to understand the discipline of databases.

Possibly, the discipline of databases and related development has just been continually developing and never settled, so that is why the curriculum hasn’t kept up. But, it really does concern me when a new developer doesn’t even understand what a JOIN is.

Edit: Or even moreso that SQL is an interface to a data engine, and was not necessarily always tightly coupled to relational databases (although it evolved often in lock step with them which is why you see them there more often).

0x50026y ago

> Instead, NoSQL systems like HBase, Cassandra, and MongoDB became fashionable as they marketed themselves as being scalable and easier to use

HBase did not - the project has always been very clear that they cater towards a very specific set of use cases - fast writes with little schema constraints, fast single-key and range/fuzzy lookups, not big ETL pipelines.

Even during the rise of Hadoop (everything is a file... I mean file based!) and the subsequent absorption of that into the Public Cloud vendors, SQL has always been there, just wrapped in different tools. These days, someone else hosts it and it's now called Athena instead of Hive, but fundamentally the same thing and has been the same thing.

Even Apache Sparks entire Dataset/Datframe interface yields SQL-like execution plans, exposing the same functions that an RDMBS would, just in Scala/Python/R.

code4tee6y ago

Yes. Plus all the SQL-like things you can do now. For example storing data on S3 and then querying it with AWS Athena is a simple and powerful way to keep a huge archive of data that you may want to query at some point.

Also, never underestimate the power and speed of a well tuned SQL-family server/cluster, even at surprisingly large scale. A lot of use cases for the older “Hadoop Cluster” type stuff have been overtaken by these approaches. I’ve seen a lot of operations spend silly sums to build ultimately quite clunky Hadoop-based systems when really they probably just needed one half decent SQL admin and a well tuned cluster.

davewritescode6y ago

I think part of the real shift back to SQL is in part because of the fact that modern streaming platforms like Kafka give relatively simple mechanisms to implement eventually consistent databases. Aside from the simple key-value store use case, this was often a very good reason to use something like Cassandra instead of MySQL.

lostsoul82826y ago

Great article. SQL is so widely known and relatively simple to implement that I'm baffled business don't start with it as a solution and then work their way into our solutions if they find it doesn't meet their needs.

It's so easy and quick to get started, it should be most people's first choice.

kohtatsu6y ago

A few days ago I designed a DSL for CREATE TABLE statements.

https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...

Being able to focus on the relationships without worrying about commas is nice.

I'm still writing the compiler (it's my first, I'm sure it will be awful), but I'm starting from the finish so it's been easy for me to pick up where I left off; I started by deciding the language, then writing the outputs by hand for the tokenizer, my two stages of AST, and the actual SQL.

32gbsd6y ago

Looks like a promo

tkyjonathan6y ago

Already started a DataOps team in my company!

cryptonector6y ago

That's because SQL is pretty awesome.

j / k navigate · click thread line to collapse

125 comments

97 comments · 26 top-level

swalsh6y ago· 14 in thread

" Up till a few years ago, the traditional way of managing data (in SQL-based RDBMSs) was considered a relic of the past, as these systems couldn't scale to cope with such a huge amount of data. "

gwbas1c6y ago

pdexter6y ago

What's the connection between NoSQL vs SQL and eventual consistency?

3 more replies

cle6y ago

2 more replies

jimbokun6y ago

6gvONxR4sf7o6y ago

Where can I learn more about these clever developments?

2 more replies

collyw6y ago

The youngsters in my work don't really know SQL.

mumblemumble6y ago

This, I think, is the true cost of technologies like ORM. And it's hard to notice, because it doesn't show up right away; it's a generational thing.

It reminds me a bit of a paper I saw recently on The Morning Paper called Ironies of Automation: https://blog.acolyer.org/2020/01/08/ironies-of-automation/

larrydag6y ago

baldfat6y ago

2 more replies

exabrial6y ago

> brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly

That what is not standardized, dies

oyoun6y ago

I totally agree with you. And SQL keep evolving

pastor_elm6y ago

So long as fresh faced developers arrive on the scene, there will always be people looking to get around using SQL.

oyoun6y ago

And database could scale up at this time: Pan-STARRS is one of the bigger database with PB

jstrong6y ago

> Also it's fast as hell

compared to what? excel?

dbatten6y ago· 8 in thread

hobs6y ago

dbatten6y ago

infogulch6y ago

tenken6y ago

Are you saying you build a completely de-normalized View of data? Could you please provide a small example?

wodenokoto6y ago

My problem with views is that you do the join before the filtering.

solidangle6y ago

Any query optimizer worth its salt will push the filters below the join.

mumblemumble6y ago

I'm curious, what DBMS are you using whose optimizer can't see through views?

For what it's worth, MSSQL, PostgreSQL and Oracle don't have that limitation.

1 more reply

buremba6y ago

Predicate pushdown is one of the first optimizations implemented in SQL planners and all the OLAP and OLTP solutions that I'm aware of already have this feature in place.

trollied6y ago· 8 in thread

Don't really get this article. SQL never went away or got less popular. Always been in the background doing its awesome thing.

mnky9800n6y ago

In data science it has been an uphill battle between the people who only think csv exists and those who switch to a new nosql flavor every 6 months

prepend6y ago

I think it’s artificially presented as an exclusive option between sql and csv/nosql/whatever in data science.

I know plenty of people who are comfortable in both. But I don’t know any “data science” people as good at sql as a good dba. But that’s ok.

overcast6y ago

1 more reply

d4nt6y ago

reilly30006y ago

prepend6y ago

I’m sure there will be an xkcd about this soon.

I’ve also tried to do things in sql only to realize that I could have done it easier in pandas.

angry_cactus6y ago

Google Trends confirms, even MongoDB is only a fraction of SQL.

collyw6y ago

Thank god. I am working with it now and its crap in comparison.

I wrote a fairly complex mango query a while back, and trying to understand it now is a nightmare. Compare that to SQL, which is one of the most readable languages out there in my opinion.

1 more reply

Twisell6y ago· 6 in thread

I'm now officially a bleeding edge DevOps with 10 years expertise on the brand new "old school" ELT (Extract,Load, Transform).

LoL

barrkel6y ago

Long scripts of SQL with intermediate steps (e.g. building temp tables) are hard to reason about and debug, and especially painful when they fail only in rare production scenarios.

Twisell6y ago

2.Testing for duplicate is a SQL one-liner

1 more reply

dkersten6y ago

pjmlp6y ago

Like writing microservices in Cobol on Amazon Cloud. :)

https://www.infoq.com/articles/migrate-mainframe-aws/

pjmlp6y ago

Yeah, that is why I long learned to be skeptical in a pop culture driven industry, always looking after the new fad to sell stuff.

kingludite6y ago

lincpa6y ago· 6 in thread

[Everything is RMDB](https://github.com/linpengcheng/PurefunctionPipelineDataflow...)

conceptoriented6y ago

(0) https://www.researchgate.net/publication/337336089_Concept-o...

lincpa6y ago

In `Everything is RMDB`, Data is dominant, It combines the advantages of RMDB and NoSQL. and it emphasizes the following points:

```

It’s better to have 100 functions operate on one data structure than 10 functions on 10 data structures.

        ---- Alan Perlis

```

threeseed6y ago

RDBMS works great up until the point that it doesn't.

jacques_chester6y ago

> Except when you have large, telemetry style datasets e.g. web/product analytics which won't fit.

Web analytics was one of the first applications for Greenplum. My understanding is that Yahoo collected tens of billions of events per day in the mid-2000s.

> Or when you are trying to build a wide table and you run out of columns.

HAWQ can run SQL queries over Hadoop clusters. Clickhouse's table width is limited by how much RAM you give it.

> Or when your favourite SaaS products gives you highly nested JSON data.

This is why major databases have JSON querying capabilities and why it's been added to the next SQL standard. PostgreSQL even allows you to define indices on fields inside your JSON structures.

Better yet: decompose the highly nested data. Relational databases begin to shine when you get past at least first normal form.

> RDBMS works great up until the point that it doesn't.

RDBMSes do work great until they don't. Which means they are almost always the best solution and almost always remain so.

Folks regularly overestimate the size of their problem and underestimate the capabilities of the literally dozens of RDBMSes now available for use. Yes, it irks me.

Disclosure: I work for VMware, which sponsors Greenplum development.

buremba6y ago

zip12346y ago

Even Sqlite has json support these days...

harshaw6y ago· 4 in thread

This quote irks me: "Any tool that relies heavily on SQL is off-limits to business users who want more than viewing static charts and dashboards but don't know the language"

The good business folks know SQL and aren't afraid of it. I used not to be sure of this until I worked in an organization where most PM types use SQL with comfort.

IpV86y ago

anbotero6y ago

I created a mini-training for the onboarding process, so they could at least look for things without killing all our read replicas. But that’s just one step; some people are just not good at it.

Fear... fear is having everyone blocked because one person is killing all the replicas for them to work on.

nemothekid6y ago

It’s not their job to optimize the database, and it’s no different from seeing that a dev pushed a new feature that causes bad performance as well

1 more reply

xiphias26y ago

The solution to this problem should be setting up resource quotas.

That person should reach its quota limits and ask you to help him optimize the query.

1 more reply

blowski6y ago· 4 in thread

I don't really understand what this article is trying to say, other than "isn't Looker great!".

Having timely and detailed data available to a wide range of people in the organisation is now seen as a competitive advantage in many industries. There's a lot of tech out there to help with this.

contravariant6y ago

The weirdest part to me is that they went from 'Everyone is back to using SQL because it's good at querying and you don't need to learn a proprietary language' to 'Let's use LookML instead'.

buremba6y ago

Correct but the problem with raw SQL is that it's not composable easily. LookML essentially provides that composability on top of SQL.

2 more replies

willvarfar6y ago

Its a blogvertisment by Holistics, which is a competitor to Looker.

blowski6y ago

Oh, I've never heard of them, thanks for the heads up. Explains why it reads like an advert - because it is.

1 more reply

ckastner6y ago· 4 in thread

The article doesn't mention performance.

They've been optimized to this kind of stuff for decades, after all.

barrkel6y ago

Most modern RDBMS are optimized for point-lookup queries and have a very hard time choosing good plans when you're selecting millions of rows amongst billions.

keanzu6y ago

That mirrors my experience with Oracle. The only difference is that there's a hinting mechanism provided which will almost always allow you to force the execution plan you want.

1 more reply

oarabbus_6y ago

>Most modern RDBMS are optimized for point-lookup queries and have a very hard time choosing good plans when you're selecting millions of rows amongst billions.

Any column store warehouse (Redshift, Bigquery, Snowflake) is terrific for these kinds of queries, though.

jstrong6y ago

> I haven't benchmarked this yet

if you don't measure, you're just guessing.

> after my first experiences with somewhat complex data transformations in numpy and pandas

there are fast ways and slow ways to use numpy/pandas, but generally speaking, it's easy to get order of magnitude improvements using pandas compared to an RDBMS.

cjf46y ago· 3 in thread

This ignorance is especially surprising given that it's essentially a solved problem (Kimball), yet if you talk about data modelling, people usually think regression, not schema.

conceptoriented6y ago

boublepop6y ago

Could you elaborate on why you think data-modeling is a solved problem? It’s seems like an opinion warranting more than a one word explanation.

mumblemumble6y ago

I'm guessing that the implication was that Kimball's dimensional modeling techniques (so, in a nutshell, snowflake and star schemata) will get the job done in almost any case.

2 more replies

cube22226y ago· 3 in thread

Hey, just wanted to chime with a tool I'm a co-author of, OctoSQL[1].

However, we're very inspired with yesteryears paper "One SQL to rule them all" and should have ergonomic steaming support with Kafka as a new datasource available very soon.

[1]: https://github.com/cube2222/octosql

keanzu6y ago

q - Text as Data

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

http://harelba.github.io/q/

marvel_boy6y ago

Do you plan support for Sqlite?

cube22226y ago

exabrial6y ago· 3 in thread

olooney6y ago

They exist, they're just not really state-of-the-art.

Apache MADLib[1] for PostgreSQL and Greenplum. Everything is done inside of SQL; for example, to fit a linear regression model, you execute the query "SELECT madlib.linregr_train(...)".

MLlib[2] for Spark now has a DataFrame based API (spark.ml) for easy integration with Spark SQL.

[1] https://madlib.apache.org/index.html

[2] http://spark.apache.org/docs/latest/ml-guide.html

sixdimensional6y ago

mmeasic6y ago

I believe that more and more providers will offer ML inside the database. You can check what Oracle is doing.

Aside from that, some of the simple models can be fitted using SQL.

iblaine6y ago· 2 in thread

The problem with modeling data isn't the lack of tools or the need to approach this problem differently. Plenty of solutions exist. People just don't care, and this is ok.

huyOP6y ago

Can you elaborate more on what you meAn by “creating clean data models”?

iblaine6y ago

Using Inmon or Kimball's approach to modeling data structures is clean. The thing with modeling data is there's no one right way but there are many wrong ways.

threeseed6y ago· 2 in thread

Just a slight conflict of interest having a company that makes SQL-based data modelling tools telling us that SQL-based data modelling in on the rise.

But given how many companies have setup data lakes with unstructured and semi-structured data (think SaaS exports) and how SQL layers have largely been unimpressive not sure it's the case.

tkyjonathan6y ago

And a lot of those lakes remain untouched for years..

EdwardDiego6y ago

It's not too far from datalake to dataswamp.

oneofthose6y ago· 1 in thread

[0] https://www.getdbt.com/

mmeasic6y ago

Yes, that was my first thought when I opened the article. dbt is amazing and basically converts your SELECT statements to CREATE TABLE statements.

I guess they target both EL from the ELT, while dbt relies on data being in your data warehouse.

__ian__6y ago· 1 in thread

BigQuery is good. Looker is OK. This reads like paid-for PR content.

keanzu6y ago

It's a blog page on holistics.io so it is paid-for PR content. Or a blogvert.

sashavingardt26y ago· 1 in thread

jszymborski6y ago

IpV86y ago· 1 in thread

edmundsauto6y ago

What is it you like about them? IIRC, their base plan starts at $1500/mo, so it's a high end offering that I haven't played with too much.

sixdimensional6y ago

Many of these developers seem eager to learn, and I am happy to mentor them and teach them the history that I know.

0x50026y ago

> Instead, NoSQL systems like HBase, Cassandra, and MongoDB became fashionable as they marketed themselves as being scalable and easier to use

Even Apache Sparks entire Dataset/Datframe interface yields SQL-like execution plans, exposing the same functions that an RDMBS would, just in Scala/Python/R.

code4tee6y ago

davewritescode6y ago

lostsoul82826y ago

It's so easy and quick to get started, it should be most people's first choice.

kohtatsu6y ago

A few days ago I designed a DSL for CREATE TABLE statements.

https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...

Being able to focus on the relationships without worrying about commas is nice.

32gbsd6y ago

Looks like a promo

tkyjonathan6y ago

Already started a DataOps team in my company!

cryptonector6y ago

That's because SQL is pretty awesome.

j / k navigate · click thread line to collapse