Best practices for writing SQL queries (opens in new tab)

(metabase.com)

424 pointsReginaDeiPirati5y ago151 comments

Best practices for writing SQL queries (opens in new tab)

(metabase.com)

424 pointsReginaDeiPirati5y ago151 comments

151 comments

108 comments · 23 top-level

hn_throwaway_995y ago· 23 in thread

This is an aside, but a colleague years back showed me his preferred method formatting SQL statements, and I've always found it to be the best in terms of readability, I just wish there was more automated tool support for this format. The idea is to line up the first value from each clause. Visually it makes it extremely easy to "chunk" the statement by clause, e.g.:

  SELECT a.foo, b.bar, g.zed
    FROM alpha a
    JOIN beta b ON a.id = b.alpha_id
    LEFT JOIN gamma g ON b.id = g.beta_id
   WHERE a.val > 1
     AND b.col < 2
   ORDER BY a.foo

saila5y ago

This is the style I've settled on lately, where all the major keywords are left-aligned and the clauses are consistently indented. It uses a bit more vertical space, but I find it easier to read than any other formatting style I've seen (in the wild or produced by formatters).

    select
      a.foo,
      b.bar,
      g.zed
    from
      alpha a
    join
      beta b
      on b.alpha_id = a.id
    left join
      gamma g
      on g.beta_id = b.id
    where
      a.val > 1
      and b.col < 2
    order by
      a.foo

It's really easy, for me anyway, to get an overview of the query with this style compared to styles that are more cramped or that are inconsistently wrapped/indented.

For simpler queries, I think this is okay too, but only if the clauses easily fit on a single line:

    select ...
    from ...
    join ...
    where ...

ianmcgowan5y ago

I do this style too, but think of each new row as basically a new table being added to the query (especially if it's a query, which may go over several rows and is indented):

    from
      alpha a
      inner join beta b on b.id = a.id
      left outer join gamma g on g.id = a.id
      left outer join (
         select z.id, count(\*) as cnt from zeta z
      ) delta on delta.id = a.id
    where
      ...

1 more reply

dbatten5y ago

My company style guide explicitly specifies this.

https://github.com/republicwireless-open/sql-style-guide

I think Mozilla's does as well.

croes5y ago

It's also easier if you want to comment out certain parts of your code during debugging.

1 more reply

Icathian5y ago

10 years messing with data across 5 major flavors of sql and this is the format I've settled on and advocated for my team. It seems to flow the best and be easiest for people to get used to.

u801e5y ago

I've found that using hanging indents and using prefixed binary operators, like you did here, is the most readable form and when adding additional clauses, makes for a readable diff.

karmakaze5y ago

I find it more readable as

  SELECT a.foo, b.bar, g.zed
  FROM alpha a
  JOIN beta b ON a.id = b.alpha_id
  LEFT JOIN gamma g ON b.id = g.beta_id
  WHERE a.val > 1
    AND b.col < 2
  ORDER BY a.foo

Usually only the conditions get deep and can also use extra indented parenthesized parts.

reminder: don't use 'OUTER' it's pure noise

AdrianB15y ago

We are doing something very similar, except we add extra indentation to the joins like you did for the second condition in the WHERE. This is because we strongly separate each block - SELECT, FROM, WHERE, <ORDER>, so everything that is in the FROM block is indented. Same for the SELECT if it so long it goes to a second or third row.

twodave5y ago

I haven't seen my formatting pattern listed, so here is the equivalent how I'd write it:

  SELECT
      a.foo, 
      b.bar, 
      g.zed
  FROM alpha a
  JOIN beta b ON a.id = b.alpha_id
  LEFT JOIN gamma g 
      ON b.id = g.beta_id
      AND a.id = g.alpha_id
  WHERE 
      a.val > 1
      AND b.col < 2
  ORDER BY 
      a.foo;

sodapopcan5y ago

My problem with formatting any code like this is that it can make diffs painful. I agree that this looks better but I would say only marginally so. And I really have no problems reading code that isn't lined up like this. I don't really have a high care level, though. I'm happy to go with the team on this one.

hn_throwaway_995y ago

I don't see why you think it would make diffs painful. If anything, in my experience it makes diffs easier because each chunk can be put on it's own, independent line so that if you change anything it is constrained to the relevant line.

1 more reply

ZeroClickOk5y ago

Reference: https://www.sqlstyle.guide/

zephharben5y ago

Jetbrains tools - specifically, PyCharm & Intellij IDEA - support this (gutter alignment) via the "Joe Celko" code style. With that enabled, you can autoformat you SQL to follow that convention.

Oddly, DataGrip doesn't provide this option.

gjvc5y ago

Would it be rude of me to ask you to open an item on https://youtrack.jetbrains.com/ for this please? I'm just a curious JetBrains DataGrip user.

ianmcgowan5y ago

https://www.sqlinform.com/ is an excellent SQL formatting tool, where you can tweak the settings to whatever extent you like. I keep it around in notepad++ just for formatting SQL the way I like, even when writing/testing the SQL in VSCode or a DB tool.

dspillett5y ago

I've taken to using a similar format too, though some seem to dislike it significantly. Other things I like for clarity and editing ease are prefix commas and lining up like parts, using something like your example:

       SELECT a.foo
            , b.bar
            , g.zed
         FROM alpha a
         JOIN beta b  ON a.id = b.alpha_id AND a.another = b.thing
    LEFT JOIN gamma g ON b.id = g.beta_id 
        WHERE a.val > 1
          AND b.col < 2
        ORDER BY a.foo

       SELECT a.foo
            , b.bar
            , g.zed
         FROM alpha a
         JOIN beta b  
              ON  a.id      = b.alpha_id 
              AND a.another = b.thing
    LEFT JOIN gamma g 
              ON b.id = g.beta_id 
        WHERE a.val > 1
          AND b.col < 2
        ORDER BY a.foo

I'm not consistent with the layout of my joining predicates - I go for whatever seems clearer given the current circumstances and that varies due to several factors (number of parts, length of column names and/or functions, ...). How sub-queries and instances of CASE are broken into lines and indented is something I also vary on.

jolmg5y ago

> though some seem to dislike it significantly

I can see why. The indentation of the whole statement is not determined by the first line, but by the 6th on the first and the 8th on the second on a `JOIN` clause. It's really arbitrary, and when you have that statement between other code, it's going to be weird how the start of the statement is much more indented than its preceding code. I really dislike it, too.

I prefer the use of indentation to signal what's inside another syntax structure. So, for example, I also dislike how you aligned `ON` and `AND` when the `AND` is inside the `ON` expression. It makes it seem like the two lines are on the same syntactic level.

Here's how I do it:

  SELECT a.foo
       , b.bar
       , g.zed
    FROM alpha a
    JOIN beta b  
      ON a.id = b.alpha_id
        AND a.another = b.thing
    LEFT JOIN gamma g 
      ON b.id = g.beta_id 
    WHERE a.val > 1
      AND b.col < 2
    ORDER BY a.foo

You might also notice that I removed the padding you used to align the `=` signs. I dislike big changes where the only thing that changed for a line is the whitespace padding. It obscures the real change. It might not seem like a big thing when you only have 2 lines in alignment, but it's a real bother when reading a diff that does that for more lines. You have to compare between the - and + lines to find what really changed instead of the diff telling you outright.

1 more reply

thih95y ago

Maintaining alignment in these queries seems a pain. I'd prefer the regular, newlines and fixed indentation; e.g.:

    SELECT a.foo, b.bar, g.zed
    FROM alpha a
    JOIN beta b ON a.id = b.alpha_id AND a.another = b.thing
      LEFT JOIN gamma g ON b.id = g.beta_id 
    WHERE a.val > 1
      AND b.col < 2
    ORDER BY a.foo

(bonus: "AND" got accidentally aligned with the end of "WHERE")

1 more reply

arh685y ago

I like your second version. My own style, still evolving, is to write more lines and align further left

  select
    a.foo
    , b.bar
    , g.zed
  from alpha a
  inner join beta b on
    b.alpha_id = a.id
    and b.thing = a.another
  left  join gamma g on
    g.beta_id = b.id
  where
    a.val > 1
    and b.col < 2
  order by
    a.foo

1 more reply

Hjfrf5y ago

Does that still look ok if you're selecting 10+ columns with functions, or would you split out the first line situationally?

hn_throwaway_995y ago

Another commenter showed how this works:

  SELECT a.foo
       , b.bar
       , g.zed
    FROM ...

While the comma placement may seem weird, it makes this exactly identical to the "AND" or "OR" placement in WHERE clauses, and the primary benefit is that it's easy to comment out any column except the first.

2 more replies

matwood5y ago

This is close to what I’ve settled on for 20 years. I’ll also indent again if there are a lot of joins/clauses in the on.

weaksauce5y ago

what about something like this? http://www.eslinstructor.net/vkbeautify/

eirki5y ago· 13 in thread

> Avoid

  SELECT
    title,
    last_name,
    first_name
  FROM books
    LEFT JOIN authors
    ON books.author_id = authors.id

> Prefer

  SELECT
    b.title,
    a.last_name,
    a.first_name
  FROM books AS b
    LEFT JOIN authors AS a
    ON b.author_id = a.id

Couldn't disagree more. One letter abbreviations hurt readability IMO.

dspillett5y ago

While single letter aliases can be bad outside of small examples like that, even worse being:

    SELECT t1.thing
         , t2.stuff
         , t3.stuffagain
         , t4.more 
      FROM SomeTable            t1 
      JOIN TableThatLinksToSelf t2 ON <join predicate>
      JOIN TableThatLinksToSelf t3 ON <join predicate>
      JOIN AnotherTable         t4 ON <join predicate>

that is not the point that is being made here. The point is that explicitly naming tables is beneficial to understanding and reducing issues later. Short alias is preferable to not specifying column sources at all.

hn_throwaway_995y ago

While I semi disagree, I think the author's primary point was that you should always scope your column names.

GordonS5y ago

It seems to be a matter of personal preference, but I've never liked single-character aliases myself, and never understood why so many seem to.

dspillett5y ago

Lazy typing: t is shorter than tableWithTheDataIWantIn

I prefer descriptive table and other object names, and abbreviate them in aliases within queries (though usually not to single letters).

1 more reply

dragonwriter5y ago

I agree with the source that the latter (explicit table specification in the SELECT list, whether using aliases or not) is to be preferred to the former; at the same time (while I am sometimes guilty of using them) I agree that single-character aliases are generally a poor choice for the same reasons that’s generally true of single character identifier names; column aliases are variable (well, constant) names and the usual rules of meaningful identifier names apply.

zoomablemind5y ago

The usefulness of this advice depends on the schema or design of the database. If the data is normalized, then it's quite reasonable to design for unambiguous field names in the queries after all relevant joins.

Sure it does not help to understand the origins of a given field without aliases, unless someone is very familiar with the schema.

prepend5y ago

I prefer having some meaningful alias because trying to remember what a,b,c,d, etc gets annoying.

chrishas355y ago

Author probably could have chosen a better example such that it doesn't look like the author chose letters sequently. In this case the letters are meaningful as they are the same as the first of the table name, a common convention, unfortunately that happens to be the first two letters of the alphabet...which yes, would be very annoying.

1 more reply

heavyset_go5y ago

Like naming variables, choose better names for your table aliases, then.

trollied5y ago

Came here to post exactly this.

If the tables names are long, by all means abbreviate them a little, but never just use 1 letter aliases.

I wonder if the author has ever worked with a system that has more than a handful of tables.

hobs5y ago

Most of your queries shouldn't be so verbose as to confuse which aliases you are talking about, but I agree - here's the best format :) (because why put ON on another line anyway?)

  SELECT
    bo.title,
    au.last_name,
    au.first_name
  FROM books AS bo
  LEFT JOIN authors AS au ON 
    bo.author_id = au.id

dragonwriter5y ago

actually, I'd say the best is “column aliases are like full identifiers that name what the entity is in the context of the query”.

In simple cases that may be just the singular of the table name, e.g.:

  SELECT
    book.title,
    author.last_name,
    author.first_name
  FROM books AS book
  LEFT JOIN authors AS author ON 
    book.author_id = author.id

But in other cases, it will be different, e.g.:

  SELECT    
    manager.last_name || ', ' || manager.first_name AS manager_name,
    coalesce(employee.title, 'All Titles') AS staff_title,
    count(employee.id) AS count
  FROM employees AS manager
  LEFT JOIN employees AS employee ON
    manager.id = employee.manager_id
  GROUP BY manager.id, ROLLUP(employee.title)
  HAVING employee.title IS NOT NULL
      OR GROUPING(employee.title)=1

mulmen5y ago

ON should be on another line because it gives context to bo.author_id = au.id.

1 more reply

tremon5y ago· 11 in thread

Avoid functions in WHERE clauses

Avoid them on the column-side of expressions. This is called sargability [1], and refers to the ability of the query engine to limit the search to a specific index entry or data range. For example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a full table scan and the SUBSTRING function will be evaluated for every row, while WHERE field LIKE "A%" can use a partial index scan, provided an index on the field column exists.

Prefer = to LIKE

And therefore this advice is wrong. As long as your LIKE expression doesn't start with a wildcard, LIKE can use an index just fine.

Filter with WHERE before HAVING

This usually isn't an issue, because the search terms you would use under HAVING can't be used in the WHERE clause. But yes, the other way around is possible, so the rule of thumb is: if the condition can be evaluated in the WHERE clause, it should be.

WITH

Be aware that not all database engines perform predicate propagation across CTE boundaries. That is, a query like this:

  WITH allRows AS (
    SELECT id,
           result = difficult_calculation(col)
    FROM table)
  SELECT result
  FROM allRows
  WHERE id = 15;

might cause the database engine to perform difficult_calculation() on all rows, not just row 15. All big databases support this nowadays, but it's not a given.

[1] https://en.wikipedia.org/wiki/Sargable

adrianmonk5y ago

Also, some databases allow you to index the result of a function. Oracle calls them "function-based indexes". PostgreSQL seems to call them "indexes on expressions".

And MySQL seems to support "generated columns" which can be "virtual" and can have indexes. (Although in that case the expression lives in the column definition, so it's not actually in a where clause.)

Also, I guess some databases probably let you have an index on a view, which could be another way.

So if you really need a function in your where clause, there may very well be a way to do it efficiently. Of course, the usual caveat applies that it requires more I/O to maintain more indexes.

matt-snider5y ago

MySQL 8.0 actually added support for functional indexes, but I found out the hard way that they don't work the same way as indexing generated columns. For me the biggest issue were the caveats around using the indexed column in conditions, which resulted in the index not being used in surprising situations. Anyways, I had to revert to generated columns which was a shame because it was a feature I had looked forward to using.

I found this article on the topic to be helpful:

https://saveriomiroddi.github.io/An-introduction-to-function...

pstuart5y ago

Don't forget SQLite!

https://sqlite.org/expridx.html

richardeb5y ago

Advice would have to be tailored to specific database technologies and probably specific versions.

For example, in Apache Impala and Spark, "Prefer = to LIKE" is good advice, especially in join conditions, where an equijoin would allow the query planner to use a Hash Join, whereas a non equijoin limits the query planner to a Nested Loop join.

musingsole5y ago

This is ultimately my problem with databases. We use the term as a catchall, but every implementation is different and is unified only in that they store tables and can respond to SQL.

People treat deciding your app will have a database as a design decision when in reality it is only about 10% of a design decision.

1 more reply

tehlike5y ago

Postgres and the likes allows trigram indexes for like queries, and expression based indexes just fine. This argument doesn't really pass the smell test.

kyrra5y ago

I learned about trigrams when working on a personal project and trying to figure out how to do full text searches on postgres.

Good blog post about it : https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...

The docs: https://www.postgresql.org/docs/current/pgtrgm.html

jaredsohn5y ago

An argument for using the where syntax in simplish adhoc queries: It is a lot shorter and has simpler syntax. Speed is more important than maintainability in this context and it shouldn't result in errors.

Basically, instead of writing

"inner join table2 on"

you can just write "and" and put it after your other where clauses.

It doesn't result in errors because the query will fail if referring to fields from the second table when there is no join.

Could even put the join where clauses on a separate line to split them out from the other where clauses.

It does require reworking the query if not doing an inner join but that is what at least I usually want when doing ad hoc queries. Agree that the join syntax should be used in production code.

twoodfin5y ago

Prefer EXISTS to IN is also a bit odd, as the latter is trivially transformed to the former. The DBMS I work with does it universally.

tremon5y ago

There is a semantic difference, in the context of handling NULLs: EXISTS only checks the size of the result set, so even a subquery that returns (NULL, NULL, NULL) will satisfy the EXISTS condition. IN, on the other hand, performs data equality checks on the results, so NULL rows will never yield a positive match.

Usually, this doesn't make a difference as an EXISTS check is mostly used on the primary key (or business key) of a table, which is (hopefully) non-nullable. But it can sometimes give surprising results when using EXISTS on a nullable column, or for worse results, NOT IN.

Semaphor5y ago

I’ve had them get treated equally and differently both in SQL Server.

johnvaluk5y ago· 9 in thread

Overall an enjoyable read, but as someone who includes SQL queries in code, I disagree with two points:

I despise table aliases and usually remove them from queries. To me, they add a level of abstraction that obscures the purpose of the query. They're usually meaningless strings generated automatically by the tools used by data analysts who rarely inspect the underlying SQL for readability. I fully agree that you should reference columns explicitly with the table name, which I think is the real point they're trying to make in the article.

While it's true that sorting is expensive, the downstream benefits can be huge. The ability to easily diff sorted result sets helps with troubleshooting and can also save significant storage space whenever the results are archived.

barrkel5y ago

Table aliases make queries much less verbose than using the full table name, and more readable as a result. Aliases are unavoidable when you're joining the table more than once. Not using qualified identifiers is just asking for trouble.

Short aliases - I tend to use the first letter of each word in the table name - work best, IMO.

nickpeterson5y ago

I tend to use two character because you might join similar named tables or the same table multiple times.

learn_more5y ago

I'm the opposite. I like the short aliases, esp. when there are many tables. Short aliases can all be the same length, and therefore align better for better readability, and they don't pollute the visibility as overly-verbose table names do.

In code, I like the length of the variable name to be proportional to the size of the scope. Small scope -- short variable names.

alex_anglin5y ago

To each their own, but in the case of ETL/ELT, you would just be asking for pain not using aliases.

gizmodo595y ago

Even there someone needs to read them eventually than just the person who wrote it. Single letter aliases are just evil. In some ways it’s the same as doing: String x = “Hello”

1 more reply

mulmen5y ago

In the case of ETL you should only be referencing those tables a few times because you are integrating them into friendly analytic models. In that case you probably have a lot of columns to wrangle and complex transformation logic. In those cases I prefer to use no alias at all to avoid the scrolling around to get context, even when table names are very long.

mike225y ago

Aliases are pretty much mandatory if you need to join the same table twice; otherwise you have no way to disambiguate column references. I understand the case of auto-gen’ed queries. However, in page long queries that I’ve written by hand, short aliases (typically the acronym of the table name, sometimes with a one-letter suffix if the table is joined multiple times) make the whole query flow like a nice story.

magicalhippo5y ago

I got descriptive table names like WhsTransactionGoodsItems and WhsTransactionGoodsItemPackages.

I feel it would be rather noisy to have to specify such table names in front of the 15+ column references in a query, compared to using aliases.

Then again I've never had to diff the result sets, so I guess our usage is quite different.

mulmen5y ago

My editor has tab completion, it's not like you have to type every character. I prefer it for readability but it's definitely debatable.

1 more reply

colonwqbang5y ago· 8 in thread

> Although it’s possible to join using a WHERE clause (an implicit join), prefer an explicit JOIN instead, as the ON keyword can take advantage of the database’s index.

This implies that WHERE style join can't use indices.

I can understand why some would prefer either syntax for readability/style reasons. But the idea that one uses indices and the other not, seems highly dubious.

Looking at the postgres manual [1], the WHERE syntax is clearly presented as the main way of inner joining tables. The JOIN syntax is described as an "alternative syntax":

> This [INNER JOIN] syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.

Maybe some database somewhere cannot optimise queries properly unless JOIN is used? Or is this just FUD?

[1] https://www.postgresql.org/docs/13/tutorial-join.html

mtone5y ago

> This syntax is not as commonly used as the one above

This is going to need some sources. Is it true today? And why did they put parentheses in the ON condition?

Worth nothing that there were variants of the WHERE syntax to support left joins using vendor-specific operators such as A += B, A = B (+) -- those are clearly deprecated today. [1] [2]

I have a really hard time finding any source on the internet that recommends using the WHERE style joins. So by extension, I wouldn't expect to be used much anymore except for legacy projects. MS SQL Server docs docs mention ON syntax being "preferred" [3], and MySQL says "Generally, the ON clause serves for conditions that specify how to join tables, and the WHERE clause restricts which rows to include in the result set." [4]

The PostgreSQL docs seem misleading and outdated to me.

[1] https://docs.microsoft.com/en-us/archive/blogs/wardpond/depr...

[2] https://docs.oracle.com/cd/B19306_01/server.102/b14200/queri...

[3] https://docs.microsoft.com/en-us/sql/relational-databases/pe...

[4] https://dev.mysql.com/doc/refman/5.7/en/join.html

colonwqbang5y ago

I'm just quoting the manual here. I have no idea which style is really prevalent in the wild, now or X years ago.

My goal was only to cast doubt on the idea that WHERE clauses in general can't use indices.

Sure, let's debate what the nicest style is. But let's not claim that our preferred style somehow makes the DB go faster (without some kind of proof).

1 more reply

gopalv5y ago

> the WHERE syntax is clearly presented as the main way of inner joining tables.

As someone who debugs a lot of SQL, I prefer the ON clause for one major reason - it is easy to notice it if it is missing.

So there was a customer who had a wrote 300 terabytes of intermediate data out of a badly written query, which wasn't caught because they ran the equivalent of

"select * from orders, customers ... "

and ended up commenting out the o_cust_id = c_cust_id in the bottom of the where clause while they were messing about with it.

And another example of a CTE which was missing the ON clause, but the outer query did have a where clause and that was great till someone else cut-pasted the CTE into a materialized view & tried to rebuild it.

> Maybe some database somewhere cannot optimise queries properly unless JOIN is used?

Until Apache calcite came in, Apache Hive could not always find the table which was joined against out of a where clause (so you'd find TPC-H queries which were planned as cross-products between the same table with different aliases etc - like Query 7 was badly planned from the beginning of Hive till Hive 1.2 and only optimally planned in Hive 3.x).

But SQL engines have gotten much better over the years & most people should write it for readability than for execution order, but the readability is really why I like the ON syntax, particularly while chop-debugging a big query written by some poor guy who was translating COBOL into SQL.

FranzFerdiNaN5y ago

I don’t think I have ever seen that way of doing an inner join in the wild, despite working as a DBA or data engineer for the past 15 years, 10 of those Postgres-only roles.

matwood5y ago

I started my dba roles back with mssql 6.5, and the join using where was all that was supported. I found the join syntax much more clear around intent and moved as soon as it was available.

SigmundA5y ago

Yes the optimizer should end up with the same plan either way, although the ON syntax is SQL standard.

majewsky5y ago

I prefer the JOIN syntax because I don't have to rewrite everything when I realize later on that I need an OUTER JOIN.

joelcollinsdc5y ago

Maybe not for indexes but what about using a sql syntax that is more common and extensible?

supernova87a5y ago· 4 in thread

Is there a good place to read from an advanced casual "lay user's" perspective what SQL query optimizers do in the background after you submit the query?

I would love to know, so that I can know what optimizations and WHERE / JOIN conditions I should really be careful about making more efficient, versus others that I don't have to worry because the optimizer will take care of it.

For example, if I'm joining 2 long tables together, should I be very careful to create 2 subtables with restrictive WHERE conditions first, so that it doesn't try to join the whole thing, or is the optimizer taking care of that if lump that query all into one entire join and only WHERE it afterwards? How do you tell what columns are indexed and inexpensive to query frequently, and which are not? Is it better to avoid joining on floating point value BETWEEN conditions?

And other questions like this.

wolf550e5y ago

You basically only need to know one thing to answer all your questions: use EXPLAIN PLAN. Postgresql has "explain analyze", which is even better than simple "explain", but all SQL databases have "explain", because they are kinda useless without it. The database will tell you what it's going to do (or what it did) and you will decide whether that's ok or whether it's doing something stupid (e.g. full table scan when only 1% of rows is needed), and then you can try things to get the plan that you want (ensuring statistics are up to date, adding indexes, changing the query, etc).

Databases have ways to query the schema which includes the index definitions, so you can know which columns and indexed (and the order of the columns in those indexes).

Unless you materialize a temporary table or materialized view or use a CTE with a planner that doesn't look inside CTEs, the planner will just "inline" your subqueries (what are "subtables"?) and it will not affect the way the join is performed.

Join on floating point value is quite rare. Why do you need to do that?

supernova87a5y ago

>Join on floating point value is quite rare. Why do you need to do that?

Ah, thanks for noticing this. They are, for example, (1) tables of timestamped events, and (2) tables of time ranges in which those events need to be associated with (but which unfortunately were not created with that in mind at the time)...

So for example FROM tableA LEFT JOIN tableB ON (timestampA BETWEEN timestampB1 AND timestampB2)

(and where the timestamps can be either floating point or integer nanoseconds)

1 more reply

fabian2k5y ago

My experience is with Postgres, this might vary for other databases. As already said, using EXPLAIN ANALYZE is very useful to see what the planner is doing. This might be hard to read for more complex queries, but it is quite understandable for simple ones.

One of the more important parts is simply understanding which indexes can be used in a query. The other part is understanding when the database will intentionally not use an index, this is mostly related to column statistics. The basics of indexes are pretty simple, but then there is a whole bunch of subtle details that can mean the index can't actually be used for your query.

Another useful part to understand is how much IO a query requires, EXPLAIN (ANALYZE, BUFFERS) is helpful for that. But you also need to understand a bit the layout Postgres uses to store data, how it is stored in pages, TOAST and related stuff.

For Postgres I'd really start with reading the manual on index types and on the statistics collector. After that I'd just play with explain analyze for queries you're writing.

The order of JOINS is optimized automatically in Postgres, but only up to a pointf, for a large number of joins it has to fall back to heuristics.

wolf550e5y ago

This is a good explanation about join order optimization: https://www.sqlite.org/queryplanner-ng.html

No database can find perfect join order when you have more than about 8 to 10 tables in the join.

carbocation5y ago· 4 in thread

Personal habit is to start my WHERE clause with a TRUE or a FALSE so that adding or removing clauses becomes seamless:

    SELECT foo
    FROM bar
    WHERE TRUE
      AND baz > boom

For OR conditions it's a bit different:

    SELECT foo
    FROM bar
    WHERE FALSE
      OR baz > boom

warent5y ago

this seems like taking on a pretty huge risk for a minor convenience. the difference between those two queries can mean the difference between protecting someone's PII

carbocation5y ago

I'm not sure that I follow. The two queries are to demonstrate difference in form; they are not intended to be equivalent.

If you're already writing:

    WHERE foo=bar
      AND biz=baz

It's not clear to me how:

    WHERE TRUE
      AND foo=bar
      AND biz=baz

is worse.

1 more reply

mactrey5y ago

I'm so attached to starting all my where clauses with a TRUE (1=1 since SQL Server doesn't have boolean literals) that I do this when I need some OR clauses:

    SELECT foo
    FROM bar
    WHERE 1=1
      AND (1<>1
           OR baz > boom
           OR fizz >= bang
           )
      AND foo is not null

So you can comment out lines starting with OR individually. Some people might hate it but it makes sense conceptually for me since almost every query I write takes a chain of ANDs in the where clause as a starting point.

magicalhippo5y ago

Yeah, I almost always do "where 1=1" with the actual expressions AND'ed below.

For OR, I like to keep the "1=1" and do

    AND (1=2
      OR ...
    )

nerdbaggy5y ago· 3 in thread

Does anybody else like putting from first? I find it makes the auto complete sooo much better and easier to read.

SigmundA5y ago

Xquery and Linq both use FLWOR like syntax which puts the "FROM" first and helps auto complete, wish SQL had ordered things this way:

https://en.wikipedia.org/wiki/FLWOR

SELECT first_name FROM person WHERE first_name LIKE 'john'

becomes:

FROM person WHERE first_name LIKE 'john' SELECT first_name

SQL reads more English like while from first is more Yoda speak but the auto-complete is worth more to me.

chubot5y ago

I like it, but sqlite doesn't seem to accept it? At least not in the version on my Ubuntu machine. Is putting from first standard SQL?

1 more reply

79525y ago

I think it would be easier to order things in terms of when they are executed. And perhaps it would be easier to teach SQL if the different parts where more obviously separate. As the different parts are actually distinct and don't really cross over. But to a newbie would seem procedural when it's not.

bob10295y ago· 2 in thread

This seems like reasonable discussion, but you would get far more traction if you have an opportunity to write an entire schema from scratch in the proper way.

Not having to fight assumptions along lines of improperly denormalized columns (i.e. which table is source of truth for a specific fact) can auto-magically simplify a lot of really horrible joins and other SQL hack-arounds that otherwise wouldn't be necessary. The essential vs accidental complexity battle begins right here with domain modeling.

You should be seeking something around 3rd normal form when developing a SQL schema for any arbitrary problem domain. Worry about performance after it's actually slow. A business expert who understands basic SQL should be able to look at and understand what every fact & relation table in your schema are for. They might even be able to help confirm the correctness of business logic throughout or even author some of it themselves. SQL can be an extremely powerful contract between the technology wizards and the business people.

More along lines of the original topic - I would strongly advocate for views in cases where repetitive, complex queries are being made throughout the application. These serve as single points of reference for a particular projection of facts and can dramatically simplify downstream queries.

grzm5y ago

One of the strengths of Metabase is that it can plug into a variety of data sources, not just RDBMS. For example, AWS Athena over data in S3 buckets. Good design can still make things easier, of course, but not always an option. Relational purity is not going to be an option in such circumstances, so are in my opinion correctly not addressed in the piece.

mulmen5y ago

Metabase sounds like a great tool for building clean analytic schemas then. You still need to design those schemas though.

Hjfrf5y ago· 2 in thread

Lots of mistakes (or at least rare opinions going against the crowd) here.

Here's a better general performance tuning handbook - https://use-the-index-luke.com/

willvarfar5y ago

More that its a dumbed-down general guide aimed at meta base users?

Use-the-index-luke is an altogether deeper, more technical article aimed at data engineers and going into the details and differences between databases.

Tomis025y ago

> aimed at data engineers

I disagree. Professional developers should know their database of choice inside and out, and use-the-index-luke helps with that. You can skip the details about databases that aren't relevant to you.

SigmundA5y ago· 2 in thread

>LIKE compares characters, and can be paired with wildcard operators like %, whereas the = >operator compares strings and numbers for exact matches. The = can take advantage of indexed columns.

Unless this specific to certain databases, LIKE can take advantage of indexes too, without wildcards LIKE should be nearly identical in performance to = both seeking the index.

>Using wildcards for searching can be expensive. Prefer adding wildcards to the end of strings. Prefixing a string with a wildcard can lead to a full table scan.

Which is contradictory to the first quote, it seems you recognize that a wildcard at the end can take advantage of an index. Full table scan is the same thing as not taking advantage of an index, hence LIKE can take advantage of normal indexes so long as there are characters before the first wildcard or has no wildcards.

Hjfrf5y ago

LIKE 'abc%' will use indexes but LIKE '%abc' will not.

At least for the latest versions of every database. If you go back to a version from 10+ years ago there's no guarantees.

SigmundA5y ago

Pedantically if your database supports index scans it can use the index on the column to scan for '%abc' rather than the whole table which can be much faster while not as a fast as a seek.

It can only do a seek if there are character before the wildcard: 'ab%c', 'abc%' and 'abc' getting progressively faster due to less index entries transversed.

1 more reply

jjice5y ago· 1 in thread

Does anyone have any good resources for practicing SQL queries? I recently had an interview where I did well on a project and the programming portions, but fumbled on the more SQL queries that were above basic joins. I didn't realize how much I need to learn and practice. I don't know if my lack of knowledge was enough to cost me the position or not yet, but I'd like to prepare for the future either way.

I've seen a few websites, but I don't know which ones to use. Or maybe there is a dataset with practice question I could download?

Edit: I found https://pgexercises.com and it's been fantastic so far. Much more responsive than other sites, clear questions, and free.

higeorge135y ago

The one you found is good for basic queries, but misses a lot of basic sql usage scenarios, such as window functions. It is PG-oriented, but i suggest https://www.postgresqltutorial.com. It also contains a sample database to practise (https://www.postgresqltutorial.com/postgresql-sample-databas...).

AdrianB15y ago· 1 in thread

Sorry to rain on your parade, but there is nothing in that article that is not included in the basic SQL manuals like Itzik Ben-Gan's.

Also a few things are dead wrong: the "make the haystack small" is optimization (it should be at the end, as the first rule says), the "prefer UNION ALL to UNION" is missing the context (a good dev knows what is needed, not what to prefer) and the usage of CTEs is nice, but sometimes slower that other options and in SQL slower can easily be orders of magnitude, so nice is not enough. Same for 'avoid sorting where possible, especially in subqueries' or "use composite indexes" (really? it's a basic thing, not a best practice).

In the past few months I interviewed and hired several DBAs, this list is ok-ish for a junior but a fail for a senior. I am not working for FAANG, so the bar is pretty low, this article would not even pass for a junior there.

supercanuck5y ago

that is because this is an advertisement for Metabase and not an actual article.

cornel_io5y ago· 1 in thread

NB: this post is mostly performance advice, and it only applies to traditional databases. Specifically, it is not good advice for big data columnar DBs, for instance a limit clause doesn't help you at all on BigQuery and grabbing fewer columns really does.

jitans5y ago

not even all the "traditional" databases, each Engine has his own peculiarities.

yardstick5y ago· 1 in thread

> Although it’s possible to join using a WHERE clause (an implicit join), prefer an explicit JOIN instead, as the ON keyword can take advantage of the database’s index.

Don’t most databases figure this out as part of the query planner anyway? Postgres has no problems using indexes for joins inside WHERE.

Hjfrf5y ago

Yes all databases will use indexes for joins. There's quite a few mistakes like that.

My guess is the author heard something about not using implicit inner joins (deprecated decades ago) and misunderstood.

E.g. This old syntax- SELECT * FROM a, b WHERE a.id = b. a_id

magicalhippo5y ago

I'd add "be aware of window functions"[1]. Certain gnarly aggregates and joins can often be much better expressed using window functions.

And at least for the database we use at work, if the sole reason for a join is to reduce the data, prefer EXISTS.

[1]: https://www.sqltutorial.org/sql-window-functions/

ineedasername5y ago

prefer an explicit JOIN

Yes absolutely, and not just for performance benefits. It's much easier to track what, how, and why you're joining to something when it's not jumbled together in a list of a dozen conditions in the WHERE clause.

I can't tell you how much bad data I've had to fix because when I break apart the implicit conditions into explicit joins it is absolutely not doing what the original author intended and it would have been obvious with an explicit join.

And then in the explicit join, always be explicit about the join type. don't just use JOIN when you want an INNER JOIN. Otherwise I have to wonder if the author accidentally left off something.

c2h5oh5y ago

CTE advice is somewhat questionable, as it is database specific.

CTEs were for a very long time an optimization fence in PostgreSQL, were not inlined and behaved more like temporary materialized views.

Only with release of PostgreSQL 12 some CTE inlining is happening - with limitations: not recursive, no side-effects and are only referenced once in a later part of a query.

Mode info: https://hakibenita.com/be-careful-with-cte-in-postgre-sql

iblaine5y ago

"Best practices for writing SQL queries in metabase" should be the title here.

10 or so years ago when SQL Server, Oracle & MySQL dominated the industry, you could talk about SQL optimization with the expectation that all advice was good advice. There are too many flavors of databases to do that today.

haolez5y ago

Metabase is an amazing product, but I'm using Superset[0] in my company because it supports Azure AD SSO, which became a necessity for us. But as soon as this feature appears in Metabase, we are switching.

[0] https://superset.apache.org/

macando5y ago

This article nudge me to google "SQL optimization tool". I found one that says: "Predict performance bottlenecks and optimize SQL queries, using AI". Basically, it gives you suggestions on how to improve your queries.

I wonder what the results would be if I ran the queries from this article through that tool.

the_arun5y ago

Anyone using Metabase? Is it worth having self hosted and managing it?

Justsignedup5y ago

some of these are inaccurate.

"a = 'foo'" is exactly the same performance as "a like 'foo'" and very close to the performance as "a like 'foo%'" and is fully indexed. When you put a wildcard in the front, the entire index is avoided, so you gotta switch to full text search.

j / k navigate · click thread line to collapse

151 comments

108 comments · 23 top-level

hn_throwaway_995y ago· 23 in thread

  SELECT a.foo, b.bar, g.zed
    FROM alpha a
    JOIN beta b ON a.id = b.alpha_id
    LEFT JOIN gamma g ON b.id = g.beta_id
   WHERE a.val > 1
     AND b.col < 2
   ORDER BY a.foo

saila5y ago

    select
      a.foo,
      b.bar,
      g.zed
    from
      alpha a
    join
      beta b
      on b.alpha_id = a.id
    left join
      gamma g
      on g.beta_id = b.id
    where
      a.val > 1
      and b.col < 2
    order by
      a.foo

It's really easy, for me anyway, to get an overview of the query with this style compared to styles that are more cramped or that are inconsistently wrapped/indented.

For simpler queries, I think this is okay too, but only if the clauses easily fit on a single line:

    select ...
    from ...
    join ...
    where ...

ianmcgowan5y ago

I do this style too, but think of each new row as basically a new table being added to the query (especially if it's a query, which may go over several rows and is indented):

    from
      alpha a
      inner join beta b on b.id = a.id
      left outer join gamma g on g.id = a.id
      left outer join (
         select z.id, count(\*) as cnt from zeta z
      ) delta on delta.id = a.id
    where
      ...

1 more reply

dbatten5y ago

My company style guide explicitly specifies this.

https://github.com/republicwireless-open/sql-style-guide

I think Mozilla's does as well.

croes5y ago

It's also easier if you want to comment out certain parts of your code during debugging.

1 more reply

Icathian5y ago

10 years messing with data across 5 major flavors of sql and this is the format I've settled on and advocated for my team. It seems to flow the best and be easiest for people to get used to.

u801e5y ago

I've found that using hanging indents and using prefixed binary operators, like you did here, is the most readable form and when adding additional clauses, makes for a readable diff.

karmakaze5y ago

I find it more readable as

  SELECT a.foo, b.bar, g.zed
  FROM alpha a
  JOIN beta b ON a.id = b.alpha_id
  LEFT JOIN gamma g ON b.id = g.beta_id
  WHERE a.val > 1
    AND b.col < 2
  ORDER BY a.foo

Usually only the conditions get deep and can also use extra indented parenthesized parts.

reminder: don't use 'OUTER' it's pure noise

AdrianB15y ago

twodave5y ago

I haven't seen my formatting pattern listed, so here is the equivalent how I'd write it:

  SELECT
      a.foo, 
      b.bar, 
      g.zed
  FROM alpha a
  JOIN beta b ON a.id = b.alpha_id
  LEFT JOIN gamma g 
      ON b.id = g.beta_id
      AND a.id = g.alpha_id
  WHERE 
      a.val > 1
      AND b.col < 2
  ORDER BY 
      a.foo;

sodapopcan5y ago

hn_throwaway_995y ago

1 more reply

ZeroClickOk5y ago

Reference: https://www.sqlstyle.guide/

zephharben5y ago

Jetbrains tools - specifically, PyCharm & Intellij IDEA - support this (gutter alignment) via the "Joe Celko" code style. With that enabled, you can autoformat you SQL to follow that convention.

Oddly, DataGrip doesn't provide this option.

gjvc5y ago

Would it be rude of me to ask you to open an item on https://youtrack.jetbrains.com/ for this please? I'm just a curious JetBrains DataGrip user.

ianmcgowan5y ago

dspillett5y ago

       SELECT a.foo
            , b.bar
            , g.zed
         FROM alpha a
         JOIN beta b  ON a.id = b.alpha_id AND a.another = b.thing
    LEFT JOIN gamma g ON b.id = g.beta_id 
        WHERE a.val > 1
          AND b.col < 2
        ORDER BY a.foo

       SELECT a.foo
            , b.bar
            , g.zed
         FROM alpha a
         JOIN beta b  
              ON  a.id      = b.alpha_id 
              AND a.another = b.thing
    LEFT JOIN gamma g 
              ON b.id = g.beta_id 
        WHERE a.val > 1
          AND b.col < 2
        ORDER BY a.foo

jolmg5y ago

> though some seem to dislike it significantly

Here's how I do it:

  SELECT a.foo
       , b.bar
       , g.zed
    FROM alpha a
    JOIN beta b  
      ON a.id = b.alpha_id
        AND a.another = b.thing
    LEFT JOIN gamma g 
      ON b.id = g.beta_id 
    WHERE a.val > 1
      AND b.col < 2
    ORDER BY a.foo

1 more reply

thih95y ago

Maintaining alignment in these queries seems a pain. I'd prefer the regular, newlines and fixed indentation; e.g.:

    SELECT a.foo, b.bar, g.zed
    FROM alpha a
    JOIN beta b ON a.id = b.alpha_id AND a.another = b.thing
      LEFT JOIN gamma g ON b.id = g.beta_id 
    WHERE a.val > 1
      AND b.col < 2
    ORDER BY a.foo

(bonus: "AND" got accidentally aligned with the end of "WHERE")

1 more reply

arh685y ago

I like your second version. My own style, still evolving, is to write more lines and align further left

  select
    a.foo
    , b.bar
    , g.zed
  from alpha a
  inner join beta b on
    b.alpha_id = a.id
    and b.thing = a.another
  left  join gamma g on
    g.beta_id = b.id
  where
    a.val > 1
    and b.col < 2
  order by
    a.foo

1 more reply

Hjfrf5y ago

Does that still look ok if you're selecting 10+ columns with functions, or would you split out the first line situationally?

hn_throwaway_995y ago

Another commenter showed how this works:

  SELECT a.foo
       , b.bar
       , g.zed
    FROM ...

2 more replies

matwood5y ago

This is close to what I’ve settled on for 20 years. I’ll also indent again if there are a lot of joins/clauses in the on.

weaksauce5y ago

what about something like this? http://www.eslinstructor.net/vkbeautify/

eirki5y ago· 13 in thread

> Avoid

  SELECT
    title,
    last_name,
    first_name
  FROM books
    LEFT JOIN authors
    ON books.author_id = authors.id

> Prefer

  SELECT
    b.title,
    a.last_name,
    a.first_name
  FROM books AS b
    LEFT JOIN authors AS a
    ON b.author_id = a.id

Couldn't disagree more. One letter abbreviations hurt readability IMO.

dspillett5y ago

While single letter aliases can be bad outside of small examples like that, even worse being:

    SELECT t1.thing
         , t2.stuff
         , t3.stuffagain
         , t4.more 
      FROM SomeTable            t1 
      JOIN TableThatLinksToSelf t2 ON <join predicate>
      JOIN TableThatLinksToSelf t3 ON <join predicate>
      JOIN AnotherTable         t4 ON <join predicate>

hn_throwaway_995y ago

While I semi disagree, I think the author's primary point was that you should always scope your column names.

GordonS5y ago

It seems to be a matter of personal preference, but I've never liked single-character aliases myself, and never understood why so many seem to.

dspillett5y ago

Lazy typing: t is shorter than tableWithTheDataIWantIn

I prefer descriptive table and other object names, and abbreviate them in aliases within queries (though usually not to single letters).

1 more reply

dragonwriter5y ago

zoomablemind5y ago

Sure it does not help to understand the origins of a given field without aliases, unless someone is very familiar with the schema.

prepend5y ago

I prefer having some meaningful alias because trying to remember what a,b,c,d, etc gets annoying.

chrishas355y ago

1 more reply

heavyset_go5y ago

Like naming variables, choose better names for your table aliases, then.

trollied5y ago

Came here to post exactly this.

If the tables names are long, by all means abbreviate them a little, but never just use 1 letter aliases.

I wonder if the author has ever worked with a system that has more than a handful of tables.

hobs5y ago

Most of your queries shouldn't be so verbose as to confuse which aliases you are talking about, but I agree - here's the best format :) (because why put ON on another line anyway?)

  SELECT
    bo.title,
    au.last_name,
    au.first_name
  FROM books AS bo
  LEFT JOIN authors AS au ON 
    bo.author_id = au.id

dragonwriter5y ago

actually, I'd say the best is “column aliases are like full identifiers that name what the entity is in the context of the query”.

In simple cases that may be just the singular of the table name, e.g.:

  SELECT
    book.title,
    author.last_name,
    author.first_name
  FROM books AS book
  LEFT JOIN authors AS author ON 
    book.author_id = author.id

But in other cases, it will be different, e.g.:

  SELECT    
    manager.last_name || ', ' || manager.first_name AS manager_name,
    coalesce(employee.title, 'All Titles') AS staff_title,
    count(employee.id) AS count
  FROM employees AS manager
  LEFT JOIN employees AS employee ON
    manager.id = employee.manager_id
  GROUP BY manager.id, ROLLUP(employee.title)
  HAVING employee.title IS NOT NULL
      OR GROUPING(employee.title)=1

mulmen5y ago

ON should be on another line because it gives context to bo.author_id = au.id.

1 more reply

tremon5y ago· 11 in thread

Avoid functions in WHERE clauses

Prefer = to LIKE

And therefore this advice is wrong. As long as your LIKE expression doesn't start with a wildcard, LIKE can use an index just fine.

Filter with WHERE before HAVING

WITH

Be aware that not all database engines perform predicate propagation across CTE boundaries. That is, a query like this:

  WITH allRows AS (
    SELECT id,
           result = difficult_calculation(col)
    FROM table)
  SELECT result
  FROM allRows
  WHERE id = 15;

might cause the database engine to perform difficult_calculation() on all rows, not just row 15. All big databases support this nowadays, but it's not a given.

[1] https://en.wikipedia.org/wiki/Sargable

adrianmonk5y ago

Also, some databases allow you to index the result of a function. Oracle calls them "function-based indexes". PostgreSQL seems to call them "indexes on expressions".

Also, I guess some databases probably let you have an index on a view, which could be another way.

So if you really need a function in your where clause, there may very well be a way to do it efficiently. Of course, the usual caveat applies that it requires more I/O to maintain more indexes.

matt-snider5y ago

I found this article on the topic to be helpful:

https://saveriomiroddi.github.io/An-introduction-to-function...

pstuart5y ago

Don't forget SQLite!

https://sqlite.org/expridx.html

richardeb5y ago

Advice would have to be tailored to specific database technologies and probably specific versions.

musingsole5y ago

This is ultimately my problem with databases. We use the term as a catchall, but every implementation is different and is unified only in that they store tables and can respond to SQL.

People treat deciding your app will have a database as a design decision when in reality it is only about 10% of a design decision.

1 more reply

tehlike5y ago

Postgres and the likes allows trigram indexes for like queries, and expression based indexes just fine. This argument doesn't really pass the smell test.

kyrra5y ago

I learned about trigrams when working on a personal project and trying to figure out how to do full text searches on postgres.

Good blog post about it : https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...

The docs: https://www.postgresql.org/docs/current/pgtrgm.html

jaredsohn5y ago

Basically, instead of writing

"inner join table2 on"

you can just write "and" and put it after your other where clauses.

It doesn't result in errors because the query will fail if referring to fields from the second table when there is no join.

Could even put the join where clauses on a separate line to split them out from the other where clauses.

It does require reworking the query if not doing an inner join but that is what at least I usually want when doing ad hoc queries. Agree that the join syntax should be used in production code.

twoodfin5y ago

Prefer EXISTS to IN is also a bit odd, as the latter is trivially transformed to the former. The DBMS I work with does it universally.

tremon5y ago

Semaphor5y ago

I’ve had them get treated equally and differently both in SQL Server.

johnvaluk5y ago· 9 in thread

Overall an enjoyable read, but as someone who includes SQL queries in code, I disagree with two points:

barrkel5y ago

Short aliases - I tend to use the first letter of each word in the table name - work best, IMO.

nickpeterson5y ago

I tend to use two character because you might join similar named tables or the same table multiple times.

learn_more5y ago

In code, I like the length of the variable name to be proportional to the size of the scope. Small scope -- short variable names.

alex_anglin5y ago

To each their own, but in the case of ETL/ELT, you would just be asking for pain not using aliases.

gizmodo595y ago

Even there someone needs to read them eventually than just the person who wrote it. Single letter aliases are just evil. In some ways it’s the same as doing: String x = “Hello”

1 more reply

mulmen5y ago

mike225y ago

magicalhippo5y ago

I got descriptive table names like WhsTransactionGoodsItems and WhsTransactionGoodsItemPackages.

I feel it would be rather noisy to have to specify such table names in front of the 15+ column references in a query, compared to using aliases.

Then again I've never had to diff the result sets, so I guess our usage is quite different.

mulmen5y ago

My editor has tab completion, it's not like you have to type every character. I prefer it for readability but it's definitely debatable.

1 more reply

colonwqbang5y ago· 8 in thread

> Although it’s possible to join using a WHERE clause (an implicit join), prefer an explicit JOIN instead, as the ON keyword can take advantage of the database’s index.

This implies that WHERE style join can't use indices.

I can understand why some would prefer either syntax for readability/style reasons. But the idea that one uses indices and the other not, seems highly dubious.

Looking at the postgres manual [1], the WHERE syntax is clearly presented as the main way of inner joining tables. The JOIN syntax is described as an "alternative syntax":

> This [INNER JOIN] syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.

Maybe some database somewhere cannot optimise queries properly unless JOIN is used? Or is this just FUD?

[1] https://www.postgresql.org/docs/13/tutorial-join.html

mtone5y ago

> This syntax is not as commonly used as the one above

This is going to need some sources. Is it true today? And why did they put parentheses in the ON condition?

Worth nothing that there were variants of the WHERE syntax to support left joins using vendor-specific operators such as A += B, A = B (+) -- those are clearly deprecated today. [1] [2]

The PostgreSQL docs seem misleading and outdated to me.

[1] https://docs.microsoft.com/en-us/archive/blogs/wardpond/depr...

[2] https://docs.oracle.com/cd/B19306_01/server.102/b14200/queri...

[3] https://docs.microsoft.com/en-us/sql/relational-databases/pe...

[4] https://dev.mysql.com/doc/refman/5.7/en/join.html

colonwqbang5y ago

I'm just quoting the manual here. I have no idea which style is really prevalent in the wild, now or X years ago.

My goal was only to cast doubt on the idea that WHERE clauses in general can't use indices.

Sure, let's debate what the nicest style is. But let's not claim that our preferred style somehow makes the DB go faster (without some kind of proof).

1 more reply

gopalv5y ago

> the WHERE syntax is clearly presented as the main way of inner joining tables.

As someone who debugs a lot of SQL, I prefer the ON clause for one major reason - it is easy to notice it if it is missing.

So there was a customer who had a wrote 300 terabytes of intermediate data out of a badly written query, which wasn't caught because they ran the equivalent of

"select * from orders, customers ... "

and ended up commenting out the o_cust_id = c_cust_id in the bottom of the where clause while they were messing about with it.

> Maybe some database somewhere cannot optimise queries properly unless JOIN is used?

FranzFerdiNaN5y ago

I don’t think I have ever seen that way of doing an inner join in the wild, despite working as a DBA or data engineer for the past 15 years, 10 of those Postgres-only roles.

matwood5y ago

I started my dba roles back with mssql 6.5, and the join using where was all that was supported. I found the join syntax much more clear around intent and moved as soon as it was available.

SigmundA5y ago

Yes the optimizer should end up with the same plan either way, although the ON syntax is SQL standard.

majewsky5y ago

I prefer the JOIN syntax because I don't have to rewrite everything when I realize later on that I need an OUTER JOIN.

joelcollinsdc5y ago

Maybe not for indexes but what about using a sql syntax that is more common and extensible?

supernova87a5y ago· 4 in thread

Is there a good place to read from an advanced casual "lay user's" perspective what SQL query optimizers do in the background after you submit the query?

And other questions like this.

wolf550e5y ago

Databases have ways to query the schema which includes the index definitions, so you can know which columns and indexed (and the order of the columns in those indexes).

Join on floating point value is quite rare. Why do you need to do that?

supernova87a5y ago

>Join on floating point value is quite rare. Why do you need to do that?

So for example FROM tableA LEFT JOIN tableB ON (timestampA BETWEEN timestampB1 AND timestampB2)

(and where the timestamps can be either floating point or integer nanoseconds)

1 more reply

fabian2k5y ago

For Postgres I'd really start with reading the manual on index types and on the statistics collector. After that I'd just play with explain analyze for queries you're writing.

The order of JOINS is optimized automatically in Postgres, but only up to a pointf, for a large number of joins it has to fall back to heuristics.

wolf550e5y ago

This is a good explanation about join order optimization: https://www.sqlite.org/queryplanner-ng.html

No database can find perfect join order when you have more than about 8 to 10 tables in the join.

carbocation5y ago· 4 in thread

Personal habit is to start my WHERE clause with a TRUE or a FALSE so that adding or removing clauses becomes seamless:

    SELECT foo
    FROM bar
    WHERE TRUE
      AND baz > boom

For OR conditions it's a bit different:

    SELECT foo
    FROM bar
    WHERE FALSE
      OR baz > boom

warent5y ago

this seems like taking on a pretty huge risk for a minor convenience. the difference between those two queries can mean the difference between protecting someone's PII

carbocation5y ago

I'm not sure that I follow. The two queries are to demonstrate difference in form; they are not intended to be equivalent.

If you're already writing:

    WHERE foo=bar
      AND biz=baz

It's not clear to me how:

    WHERE TRUE
      AND foo=bar
      AND biz=baz

is worse.

1 more reply

mactrey5y ago

I'm so attached to starting all my where clauses with a TRUE (1=1 since SQL Server doesn't have boolean literals) that I do this when I need some OR clauses:

    SELECT foo
    FROM bar
    WHERE 1=1
      AND (1<>1
           OR baz > boom
           OR fizz >= bang
           )
      AND foo is not null

magicalhippo5y ago

Yeah, I almost always do "where 1=1" with the actual expressions AND'ed below.

For OR, I like to keep the "1=1" and do

    AND (1=2
      OR ...
    )

nerdbaggy5y ago· 3 in thread

Does anybody else like putting from first? I find it makes the auto complete sooo much better and easier to read.

SigmundA5y ago

Xquery and Linq both use FLWOR like syntax which puts the "FROM" first and helps auto complete, wish SQL had ordered things this way:

https://en.wikipedia.org/wiki/FLWOR

SELECT first_name FROM person WHERE first_name LIKE 'john'

becomes:

FROM person WHERE first_name LIKE 'john' SELECT first_name

SQL reads more English like while from first is more Yoda speak but the auto-complete is worth more to me.

chubot5y ago

I like it, but sqlite doesn't seem to accept it? At least not in the version on my Ubuntu machine. Is putting from first standard SQL?

1 more reply

79525y ago

bob10295y ago· 2 in thread

This seems like reasonable discussion, but you would get far more traction if you have an opportunity to write an entire schema from scratch in the proper way.

grzm5y ago

mulmen5y ago

Metabase sounds like a great tool for building clean analytic schemas then. You still need to design those schemas though.

Hjfrf5y ago· 2 in thread

Lots of mistakes (or at least rare opinions going against the crowd) here.

Here's a better general performance tuning handbook - https://use-the-index-luke.com/

willvarfar5y ago

More that its a dumbed-down general guide aimed at meta base users?

Use-the-index-luke is an altogether deeper, more technical article aimed at data engineers and going into the details and differences between databases.

Tomis025y ago

> aimed at data engineers

I disagree. Professional developers should know their database of choice inside and out, and use-the-index-luke helps with that. You can skip the details about databases that aren't relevant to you.

SigmundA5y ago· 2 in thread

>LIKE compares characters, and can be paired with wildcard operators like %, whereas the = >operator compares strings and numbers for exact matches. The = can take advantage of indexed columns.

Unless this specific to certain databases, LIKE can take advantage of indexes too, without wildcards LIKE should be nearly identical in performance to = both seeking the index.

>Using wildcards for searching can be expensive. Prefer adding wildcards to the end of strings. Prefixing a string with a wildcard can lead to a full table scan.

Hjfrf5y ago

LIKE 'abc%' will use indexes but LIKE '%abc' will not.

At least for the latest versions of every database. If you go back to a version from 10+ years ago there's no guarantees.

SigmundA5y ago

Pedantically if your database supports index scans it can use the index on the column to scan for '%abc' rather than the whole table which can be much faster while not as a fast as a seek.

It can only do a seek if there are character before the wildcard: 'ab%c', 'abc%' and 'abc' getting progressively faster due to less index entries transversed.

1 more reply

jjice5y ago· 1 in thread

I've seen a few websites, but I don't know which ones to use. Or maybe there is a dataset with practice question I could download?

Edit: I found https://pgexercises.com and it's been fantastic so far. Much more responsive than other sites, clear questions, and free.

higeorge135y ago

AdrianB15y ago· 1 in thread

Sorry to rain on your parade, but there is nothing in that article that is not included in the basic SQL manuals like Itzik Ben-Gan's.

supercanuck5y ago

that is because this is an advertisement for Metabase and not an actual article.

cornel_io5y ago· 1 in thread

jitans5y ago

not even all the "traditional" databases, each Engine has his own peculiarities.

yardstick5y ago· 1 in thread

> Although it’s possible to join using a WHERE clause (an implicit join), prefer an explicit JOIN instead, as the ON keyword can take advantage of the database’s index.

Don’t most databases figure this out as part of the query planner anyway? Postgres has no problems using indexes for joins inside WHERE.

Hjfrf5y ago

Yes all databases will use indexes for joins. There's quite a few mistakes like that.

My guess is the author heard something about not using implicit inner joins (deprecated decades ago) and misunderstood.

E.g. This old syntax- SELECT * FROM a, b WHERE a.id = b. a_id

magicalhippo5y ago

I'd add "be aware of window functions"[1]. Certain gnarly aggregates and joins can often be much better expressed using window functions.

And at least for the database we use at work, if the sole reason for a join is to reduce the data, prefer EXISTS.

[1]: https://www.sqltutorial.org/sql-window-functions/

ineedasername5y ago

prefer an explicit JOIN

And then in the explicit join, always be explicit about the join type. don't just use JOIN when you want an INNER JOIN. Otherwise I have to wonder if the author accidentally left off something.

c2h5oh5y ago

CTE advice is somewhat questionable, as it is database specific.

CTEs were for a very long time an optimization fence in PostgreSQL, were not inlined and behaved more like temporary materialized views.

Only with release of PostgreSQL 12 some CTE inlining is happening - with limitations: not recursive, no side-effects and are only referenced once in a later part of a query.

Mode info: https://hakibenita.com/be-careful-with-cte-in-postgre-sql

iblaine5y ago

"Best practices for writing SQL queries in metabase" should be the title here.

haolez5y ago

[0] https://superset.apache.org/

macando5y ago

I wonder what the results would be if I ran the queries from this article through that tool.

the_arun5y ago

Anyone using Metabase? Is it worth having self hosted and managing it?

Justsignedup5y ago

some of these are inaccurate.

j / k navigate · click thread line to collapse