- create tables, update the schema, insert rows, add an index
- select, filters, joins, order by, limit, inner queries
It takes forever to be comfortable with:
- anything that involves summarizing, grouping, having, min, max, windows
The other tip is to sketch the problem in excel/google sheets when it gets hairy. Not the actual code (I don't have a clue how to do that, others have), just the values in the different steps. In the end it is only about rows and columns.
But that said, these days a lot of it happens intuitively for me, I pretty much know the solution before I can spell it out. It certainly was not like that when I started.
When you begin, "programming without for loops" feels like programming with your right hand tied behind your back. But in hindsight you get a lot of exercise in the immutable paradigms of functional programming, working with comprehensions, sets, maps folds comes very natural.
Oh and lest I forget you can't just re-use a CTE in another query. But you can of course re-use a view. Also given what another user here remarked, Postgres might internally treat (and optimize) CTEs like views, so to me that makes views superior to CTEs in more or less all respects.
I've used SQL enough to have to write nested queries, haven't dove further than that.
For starters, i can never actually tests parts of those queries without rewriting the query up to the part that i want to test, for example:
WITH
query_one AS (SELECT ...),
query_two AS (SELECT ...),
query_three AS (SELECT ...)
SELECT ... /* main query */
If i want to test the second query, i need to take the first and second ones, copy them into a new worksheet and then rewrite the second one not to have the alias but instead be the main query. This is annoying when you have 5-10 CTEs and you need to test something in the middle.Then, working with SQL and CTEs feels like going back from a language where functions are first class citizens to one where no such thing exists, just in regards to querying data. It would be nice if i could store parts of queries under packages, to be able to write dynamic SQL more easily, instead of having to use tools like myBatis for this purpose: https://mybatis.org/mybatis-3/sqlmap-xml.html (see the bit about SQL fragments)
So i'd like to do the following:
PACKAGE my_snippets BODY IS
SNIPPET query_one
SELECT ... /* probably 500 lines long but often used snippet */
END query_one;
END my_snippets;
/* and then, somewhere in code */
WITH
query_one AS my_snippets.query_one,
query_two AS (SELECT ...),
query_three AS (SELECT ...)
SELECT ... /* main query */
Now, you might suggest that using views works for this intent, but what about most DBMSes out there having silly naming rules and restrictions? I don't want to work with v_mtz_wg_priv_prod_attr because someone thought that having just a few dozen characters makes sense as a restriction. Furthermore, you really can't group views into logical packages based on their intent, now can you? So, with views you end up with something that's very much like your cluttered list of tables, which gets really hard to get a good overview of when you have about 300 of them.Next up, debugging in databases is just really bad. How am i supposed to put logging in the queries, without mixing the logging code with the other triggers and tables? What about debugging long running processes? What about adding breakpoints that i can trigger when a particular view or table is accessed? What about doing this on the server while i have a local app instance connected to the DB, or maybe even another app server? Why can't i step through the query execution and see how the filtered record count changes with each "step"?
Apart from that, my problems are largely with the tooling around databases. There are relatively few universal (cross language) DB migration solutions out there, for example dbmate, every framework seems to have its own approach. There seems to be this odd division between procedural SQL and regular SQL statements, where what you can do differs based on context, which is inconsistent. Procedural languages as a whole vary wildly in what they can do - you won't be doing complex logic with custom types on MySQL/MariaDB anytime soon, whereas Oracle or PostgreSQL will suffice. But even those two have different dialects, it's never "just SQL". There are oddities with selecting certain kinds of data, only pgAdmin seems to work nicely with geospatial data, but apart from that i've also seen problems with using lower level JDBC logic which you can't really test outside of the app, in something like SQL Developer. But even apart from that, as much as we like ER diagrams, MySQL Workbench is the only tool that i've seen which allows you to actually do model driven development properly and synchronize schemas and do forward/backward engineering - even pgAdmin fails at doing this. Oh, and the tools themselves are really inconsistent - you'll see a world of difference between MySQL Workbench, pgAdmin, SQL Developer, JetBrains DataGrip and others.
And now those DBMSes are attempting to add more functionality, such as exposing REST interfaces, instead of fixing the underlying and dated problems, because people out there are relying on those and therefore the logic is set into stone. It's no wonder that every year there's a new product or two that attempt to improve upon these, even if most of the time those products die out.
Perhaps the above is a stream of consciousness with some annoying things that i've dealt with over the years, but personally, relational databases are something that i use because they're often the least horrible tool for the job, even if they are not pleasant or easy to use, at least as easy as they should be. That's where i think the main problem lies - tools should be good for solving the problems on which they'll be used, these ones aren't.
Someone with 20 years of experience might have a different outlook, but personally i'd suggest that you utilize DBMSes for what they're good for - storing, retrieving and manipulating data and don't get too carried away with in database processing otherwise, since doing certain things within the app code seems to scale horizontally far more easier in some situations, has better auditability, debugging etc.
To make a non-trivial SQL query scale to non-trivial amounts of data, you have to understand the physical data organization and how query optimizer is likely to use it, which is kind of contradictory to the idea of the SQL as a "declarative" language where you just say what you want, and let the query optimizer figure out how to get it.
Instead, you have to design your indexes carefully to coax the optimizer into choosing a reasonable access path for your particular query. And do the same for all queries where performance is important.
Indexes are fundamentally not about data, but about access patterns. Which is what the developers are responsible for. That's why physical database design is a development task, not database administration task.
And I always forget which join does what.
However WINDOW queries definitely have a learning curve. Not the least because useful examples almost always require you to use a nested query.
Another one that caught me by surprise was NULL vs unknown[1]. That bit me in a couple of queries.
I found that the most important success factors in learning SQL is the analytical thinking of the trainee and the way the trainer is explaining the concepts, in what order and what examples are used (the best examples are the ones the trainees meet in their regular work).
The functions are simple, the only difficulty is to remember the ones that are not used often enough (ex: some window functions). Even in that case, a quick check in the documentation is enough to get up to speed. The major difficulty with SQL is to write efficient queries on large data volumes, covered by the right indexes. This is very specific to each RDBMS, especially because of the tools helping with the work are specific (ex: SSMS, SQL Sentry Plan Explorer, statistics parser etc).
select c.cid, c2.cid
from customer as c
inner join customer as c2 on c.street = c2.street
where c.city <> c2.city
though that has reflective duplicates say (1, 5) would also have (5, 1) in the output. So I'm not sure if that's "allowed"Was this done on paper? Typing ≠ takes some doing.
> Participants wrote their notes and answers on paper, which they showed in front of the webcam.
Yes it seems they did
Moreover, IDE would have helped those students for sure: SQL keywords are highlighted in different color + autocomplete.
First, they're there, and it's absolutely wonderful! I use far more semantically accurate Unicode rather than lossy ASCII approximations than I did back in my old Windows days. (If you don't know the special characters you can get, turn on Keyboard Viewer and whack your keyboard, especially modifier keys, a bit.)
But … I can't customise them. Even back in the days when macOS was OS X and believed in user customisation, these specific shortcuts were frozen and un-customizable. (Like the folder shortcuts in Finder. Maybe it makes sense to you for CMD-SHIFT-D to open the Downloads folder, not the Desktop. Too bad!)
(Boy, I hope I'm wrong and someone will come along and explain my stupidity to me.)
-------
Rather than improve how SQL is taught (which seems to be the paper's objective), why not improve SQL so it isn't as horrible to try to learn in the first place?
The barriers to grokking SQL could be lowered considerably if SQL made minor adjustments like moving the projection part of a SELECT query to being below or syntactically after the WHERE clause instead of being at the top, and making SQL more "natural" to write-in without needing excessively verbose inner-derived-table expressions when all you want is to do perform some repetitive calculation which will be reused in later query steps.
Also, the GROUP BY clause really needs to be renamed to "AGGREGATE BY" or similar, because when normal people think "group" they're probably thinking of sorting/ORDER BY or PARTITION BY and they certainly don't imagine "don't display these rows at all, lol".
I just don't understand what drives the ISO SQL language design committee - I'd have thought that the newer revisions (e.g. SQL-2003) would have improved the language's ergonomics - on the contrary: the language's grammar and verbosity gets worse every release, and the team has strange priorities: apparently they feel needing to generate in-memory XML is more important than deferrable constraints - and I only ever see ISO SQL's XML features being abused to make-up for a lack of decent string-aggregation functions.
(...I could talk for hours about everything wrong with SQL.)
This requires experience which can only be gained by rolling up your sleves and working on stuff until the high ambition has been satisfied. Sometimes when I see old code from myself, and I can follow what I have been doing I get really proud of myself. Many times I end up slightly improving it based on new knowledge I have acquired since I initially wrote it.
I’m guessing the SQL and C++ committees looked at that transition and decided that such transformative changes really need to be done in new languages (like the Perl -> Raku change) rather than in a new version which risks alienating your existing base.
To be clear: I am not advocating for a brand new query-language syntax or any kind of Python3-style overhaul, but I'd like to see SQL start to take small steps towards integrating the lessons learned from the past 60+ years of language design rather than doing the complete opposite.
This is not to minimize the pain of switching. But it does not seem to ahve limited the success of the language.
Understanding the difference between declarative and imperative programming is rather hard with all the abstractions we have today.
People always say, declarative programming is defining what you want, not doing the steps needed to get it. But today no imperative interface requires you to do all the steps either, plus, most programming languages use both paradigms at the same time.
The distinction is are you telling the computer *how* to do it, or telling it *what* steps to take.
If, even with access to all of the code, you'd have to ask the computer how it chose to do it to figure out what it did, you have a declarative system. If the code reads like instructions for a recipe, it is imperative.
The complications come with the fact that these two paradigms do not describe all of the possibilities. Notably object oriented and functional designs are neither imperative or declarative. (But may share some features with both.)
Some tasks that are seemingly simple in a normal programming language can sometimes be impossible to achieve in SQL (e.g. dynamically generated columns...)
SQL, while not perfect, is very compact and direct. It allows you to express what you want succinctly and without boilerplate. No classes, no variable declarations (sure you can DECLARE a variable, but it is rarely needed), no dependencies or imports.
There's a reason why, despite the promises of every BI tool that it will "simplify" your database and "empower users", none of them have toppled SQL or even added anything useful that SQL could incorporate.
Graphical query designers are nice but have limited capabilities. SQL could occasionally be less verbose and IDEs could probably do better in reducing keystrokes (better autocomplete), but SQL itself, overall, is pure and brilliant.
Figuring out what you want, and understanding your data well enough to know what's viable and what's nonsense, is infinitely harder. And it changes every time.
Getting good at that part is "expertise" in a nutshell - gradually learning what strategies work and when, and getting better and better at your guesses. That takes more than a few weeks; that's an entire career.
Pivoting data is still extremely painful using raw SQL
Similarly, you can't use SQL to apply color formatting to any of the result rows or values -- but that was never a goal SQL intended to achieve.
Should we think about an alternative, at least for ML ETL workloads?
Does it? I think SQL just sucks and its tooling sucks too
Even SQL Management Studio which felt way better than PGAdmin is miles behind IntelliSense that's offered by Visual Studio for C# (when it comes to reliability)
SQL would benefit a lot from being like C#'s LINQ (Query syntax) e.g:
var result = from s in stringList
where s.Contains("Tutorials")
select s;
some SQLv2 is something we need select s from stringList where s like '%Tutorials%'Overall I don't think it is that steep, though maybe I'm blinded by having worked with various implementations of it for more than two decades. The key sticking point is jumping to thinking in a set based manner to get best results. The rest of the difficult parts are when you need to think about implementation details because the query planners are no perfect (index hints and such) or being aware of limitations (like postgres before the latest major version having optimisation fences around CTEs).
> It expose almost zero insights into the underlying query execution.
That is pretty much by design. It is intended that you say what you want and let the query planner worry about implementation details. Of course how you tell it what you want involves learning to express those intentions in SQL. It does fall apart a bit when implementation limitations become an issue, at which point you are forced to think about the underlying implementation and how you might prod this more imperative code so that it interprets and process your relational descriptions most efficiently.
> As a result, increasingly amount of inefficient queries are being written by ML engineers
That isn't specific to ML. I see a lot of inefficient data interaction from code written by other devs. This seems to be for two reasons:
1. People seem to have taken to heart “make it work, make it work correctly, only then worry about making it work fast” to heart but tend to skip that last part and assume because all is well with their test sets of data at hundreds or thousands of rows (or sometimes tens and singles) that it'll scale just find to the hundreds of thousands or more that the clients datasets will eventually contain.
2. People using further abstractions without much care for how they implement their directives (again, in an ideal world they shouldn't have to), resulting in massively overcomplex queries as the framework tries to be clever and helpful and preempt what might be needed, getting everything whether needed or not (effectively `SELECT `) meaning the query planner can't apply families of its internal tricks for better performance, or getting many rows individually instead of as a set which sometimes means a lot of extra work for each row.
There is a definite “we'll worry about that when it happens attitude in both cases which is dangerous. While a live system has practically ground to a halt and the client needs their report by EOP or someone will get it in the neck (and be sure: they will pass that on to you!) is not a good time to be optimising data access, or worse finding out the structure just doesn't support efficient generation of the required data. Another common failing is applying what would idealy be UI or BLL concerns (timezone conversions etc) in the SQL statements in a way that blocks index use.
> Should we think about an alternative, at least for ML ETL workloads?*I don't work with ML so that is a little outside my day-to-day wexpertise, but I'd wager ETL there has the same problem as everywhere: the basics are all well known and very well optimised for already. The rest differ so much between applications that no one abstraction would be optimal for more than a small portion of real world needs.
I'd be wary of a separate team for optimising queries. I suggest a reasonable understanding in the whole dev team with a data expert embedded who is involved in design work and code reviews so issues are caught early and junior devs can be tutored as needed so by the time they are seniors they don't need the data expert except for really gnarly problems or long-term planning.
SQL's pseudo-natural language syntax is an embarrassment and its lack of composability is even more of an embarrassment.
I find them to be a huge step forwards in terms of adding composability to complex queries.
As data munging is about combining/correlating/sorting/grouping data, why not have a sound (bag) algebra to do that? Such algebra would give us equational reasoning, proofs, etc.
And consequently: students would be learning an algebra which is easier to learn IMO.
It started out with defining relations as a mathematical construct, and continued with various operators on such relations. Then they continued explaining the various normal forms up the fifth normal form. I was completely out of my depth, but at least it was good and solid theory that could be learned.
What really messed with my head is they then introduced SQL as a 'practical' implementation of relational algebra. I'm still having nightmares where I try to understand nested HAVING statements that where asked at the exams.
Hey relations don't contain duplicates! But that's OK. We should call (modern) SQL 'BAG ALGEBRA'.
Usually I see people struggling to formulate questions. They know what they want, but don’t understand how to get there. Left to their own devices, they hack up some nightmare in Excel.
I worked with a summer intern on creating reports and learning SQL. She was a really smart business major who ended up with the wrong work assignment. I was getting 5-7 questions a day from her in June, 1-2 a week in July and by the time I got back from vacation in August, she had basically done about 90% of a project that was going to be hired out and was showing me some features of the database we were using that I didn’t know!
It inspired her to switch majors and she is a fancy data scientist somewhere! Awesome mentor experience.
Step 2: Let's make a database engine that doesn't use SQL.
Step 3: "This is hard!"
Step 4: Make SQL access layer.
Wash. Rinse. Repeat.
See: PartiQL
Those who ignore the lessons of SQL are doomed to reimplement them...poorly.
1. "SQL doesn't scale!"
2. We made this database engine that's "web scale"!
3. "This is hard to use!"
4. Make SQL access layer.
See: Spanner
Those who blame SQL for their performance problems are doomed to repeat them using a proprietary syntax.
Students that have zero prior knowledge in programming, are able to pick up functional programming pretty easy. Students that have studied and used paradigms like OOP, seem to have a hard time grokking functional programming - as they see everything through the lens of OOP (and the languages they've used).
http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom... uses Java to discuss what this can look like in an extreme case.
Before learning interpreters/compilers/prolog, I'd spend a lot of time trying to figure out about naming/namespaces in queries, while after doing some PLT, it all becomes very very obvious, you can now focus on the operators and since you'd know how far can programming go, you'd see faster how nested queries could make sense, what aggregating functions meants etc
This shows that those participants lack very basic foundational knowledge. It doesn't surprise me, because in my experience all programming courses that taught SQL early have been terrible.
Nobody who already has a basic understanding of computer science would make this mistake.
On the flipside there's really no point of devoting much time to teaching SQL later, because once you have a good understanding of data structures and algorithms, it is rather easy to make educated guesses of what is happening behind the scenes in a database - and you would have no trouble of teaching yourself SQL if necessary at some point.
Not to mention that teaching databases before what makes up their implementation is teaching software development in precisely the opposite way it is practiced: the composition of lower level concepts into higher level abstractions.
Last but not least, when you're teaching future software engineers, at the end you don't want them to just say "I can use this", you want them to say "I could build this". Teaching SQL early smells like surrender.
- SELECT name, address
- FROM user, location
- WHERE user.locationid = location.id