Avoid them on the column-side of expressions. This is called sargability [1], and refers to the ability of the query engine to limit the search to a specific index entry or data range. For example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a full table scan and the SUBSTRING function will be evaluated for every row, while WHERE field LIKE "A%" can use a partial index scan, provided an index on the field column exists.
Prefer = to LIKE
And therefore this advice is wrong. As long as your LIKE expression doesn't start with a wildcard, LIKE can use an index just fine.
Filter with WHERE before HAVING
This usually isn't an issue, because the search terms you would use under HAVING can't be used in the WHERE clause. But yes, the other way around is possible, so the rule of thumb is: if the condition can be evaluated in the WHERE clause, it should be.
WITH
Be aware that not all database engines perform predicate propagation across CTE boundaries. That is, a query like this:
WITH allRows AS (
SELECT id,
result = difficult_calculation(col)
FROM table)
SELECT result
FROM allRows
WHERE id = 15;
might cause the database engine to perform difficult_calculation() on all rows, not just row 15. All big databases support this nowadays, but it's not a given.And MySQL seems to support "generated columns" which can be "virtual" and can have indexes. (Although in that case the expression lives in the column definition, so it's not actually in a where clause.)
Also, I guess some databases probably let you have an index on a view, which could be another way.
So if you really need a function in your where clause, there may very well be a way to do it efficiently. Of course, the usual caveat applies that it requires more I/O to maintain more indexes.
I found this article on the topic to be helpful:
https://saveriomiroddi.github.io/An-introduction-to-function...
For example, in Apache Impala and Spark, "Prefer = to LIKE" is good advice, especially in join conditions, where an equijoin would allow the query planner to use a Hash Join, whereas a non equijoin limits the query planner to a Nested Loop join.
People treat deciding your app will have a database as a design decision when in reality it is only about 10% of a design decision.
Good blog post about it : https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...
The docs: https://www.postgresql.org/docs/current/pgtrgm.html
Basically, instead of writing
"inner join table2 on"
you can just write "and" and put it after your other where clauses.
It doesn't result in errors because the query will fail if referring to fields from the second table when there is no join.
Could even put the join where clauses on a separate line to split them out from the other where clauses.
It does require reworking the query if not doing an inner join but that is what at least I usually want when doing ad hoc queries. Agree that the join syntax should be used in production code.
Usually, this doesn't make a difference as an EXISTS check is mostly used on the primary key (or business key) of a table, which is (hopefully) non-nullable. But it can sometimes give surprising results when using EXISTS on a nullable column, or for worse results, NOT IN.
SELECT a.foo, b.bar, g.zed
FROM alpha a
JOIN beta b ON a.id = b.alpha_id
LEFT JOIN gamma g ON b.id = g.beta_id
WHERE a.val > 1
AND b.col < 2
ORDER BY a.foo select
a.foo,
b.bar,
g.zed
from
alpha a
join
beta b
on b.alpha_id = a.id
left join
gamma g
on g.beta_id = b.id
where
a.val > 1
and b.col < 2
order by
a.foo
It's really easy, for me anyway, to get an overview of the query with this style compared to styles that are more cramped or that are inconsistently wrapped/indented.For simpler queries, I think this is okay too, but only if the clauses easily fit on a single line:
select ...
from ...
join ...
where ... from
alpha a
inner join beta b on b.id = a.id
left outer join gamma g on g.id = a.id
left outer join (
select z.id, count(\*) as cnt from zeta z
) delta on delta.id = a.id
where
...https://github.com/republicwireless-open/sql-style-guide
I think Mozilla's does as well.
SELECT a.foo, b.bar, g.zed
FROM alpha a
JOIN beta b ON a.id = b.alpha_id
LEFT JOIN gamma g ON b.id = g.beta_id
WHERE a.val > 1
AND b.col < 2
ORDER BY a.foo
Usually only the conditions get deep and can also use extra indented parenthesized parts.reminder: don't use 'OUTER' it's pure noise
SELECT
a.foo,
b.bar,
g.zed
FROM alpha a
JOIN beta b ON a.id = b.alpha_id
LEFT JOIN gamma g
ON b.id = g.beta_id
AND a.id = g.alpha_id
WHERE
a.val > 1
AND b.col < 2
ORDER BY
a.foo;Oddly, DataGrip doesn't provide this option.
SELECT a.foo
, b.bar
, g.zed
FROM alpha a
JOIN beta b ON a.id = b.alpha_id AND a.another = b.thing
LEFT JOIN gamma g ON b.id = g.beta_id
WHERE a.val > 1
AND b.col < 2
ORDER BY a.foo
or SELECT a.foo
, b.bar
, g.zed
FROM alpha a
JOIN beta b
ON a.id = b.alpha_id
AND a.another = b.thing
LEFT JOIN gamma g
ON b.id = g.beta_id
WHERE a.val > 1
AND b.col < 2
ORDER BY a.foo
I'm not consistent with the layout of my joining predicates - I go for whatever seems clearer given the current circumstances and that varies due to several factors (number of parts, length of column names and/or functions, ...). How sub-queries and instances of CASE are broken into lines and indented is something I also vary on.I can see why. The indentation of the whole statement is not determined by the first line, but by the 6th on the first and the 8th on the second on a `JOIN` clause. It's really arbitrary, and when you have that statement between other code, it's going to be weird how the start of the statement is much more indented than its preceding code. I really dislike it, too.
I prefer the use of indentation to signal what's inside another syntax structure. So, for example, I also dislike how you aligned `ON` and `AND` when the `AND` is inside the `ON` expression. It makes it seem like the two lines are on the same syntactic level.
Here's how I do it:
SELECT a.foo
, b.bar
, g.zed
FROM alpha a
JOIN beta b
ON a.id = b.alpha_id
AND a.another = b.thing
LEFT JOIN gamma g
ON b.id = g.beta_id
WHERE a.val > 1
AND b.col < 2
ORDER BY a.foo
You might also notice that I removed the padding you used to align the `=` signs. I dislike big changes where the only thing that changed for a line is the whitespace padding. It obscures the real change. It might not seem like a big thing when you only have 2 lines in alignment, but it's a real bother when reading a diff that does that for more lines. You have to compare between the - and + lines to find what really changed instead of the diff telling you outright. SELECT a.foo, b.bar, g.zed
FROM alpha a
JOIN beta b ON a.id = b.alpha_id AND a.another = b.thing
LEFT JOIN gamma g ON b.id = g.beta_id
WHERE a.val > 1
AND b.col < 2
ORDER BY a.foo
(bonus: "AND" got accidentally aligned with the end of "WHERE") select
a.foo
, b.bar
, g.zed
from alpha a
inner join beta b on
b.alpha_id = a.id
and b.thing = a.another
left join gamma g on
g.beta_id = b.id
where
a.val > 1
and b.col < 2
order by
a.foo SELECT a.foo
, b.bar
, g.zed
FROM ...
While the comma placement may seem weird, it makes this exactly identical to the "AND" or "OR" placement in WHERE clauses, and the primary benefit is that it's easy to comment out any column except the first.This implies that WHERE style join can't use indices.
I can understand why some would prefer either syntax for readability/style reasons. But the idea that one uses indices and the other not, seems highly dubious.
Looking at the postgres manual [1], the WHERE syntax is clearly presented as the main way of inner joining tables. The JOIN syntax is described as an "alternative syntax":
> This [INNER JOIN] syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.
Maybe some database somewhere cannot optimise queries properly unless JOIN is used? Or is this just FUD?
This is going to need some sources. Is it true today? And why did they put parentheses in the ON condition?
Worth nothing that there were variants of the WHERE syntax to support left joins using vendor-specific operators such as A += B, A = B (+) -- those are clearly deprecated today. [1] [2]
I have a really hard time finding any source on the internet that recommends using the WHERE style joins. So by extension, I wouldn't expect to be used much anymore except for legacy projects. MS SQL Server docs docs mention ON syntax being "preferred" [3], and MySQL says "Generally, the ON clause serves for conditions that specify how to join tables, and the WHERE clause restricts which rows to include in the result set." [4]
The PostgreSQL docs seem misleading and outdated to me.
[1] https://docs.microsoft.com/en-us/archive/blogs/wardpond/depr...
[2] https://docs.oracle.com/cd/B19306_01/server.102/b14200/queri...
[3] https://docs.microsoft.com/en-us/sql/relational-databases/pe...
My goal was only to cast doubt on the idea that WHERE clauses in general can't use indices.
Sure, let's debate what the nicest style is. But let's not claim that our preferred style somehow makes the DB go faster (without some kind of proof).
As someone who debugs a lot of SQL, I prefer the ON clause for one major reason - it is easy to notice it if it is missing.
So there was a customer who had a wrote 300 terabytes of intermediate data out of a badly written query, which wasn't caught because they ran the equivalent of
"select * from orders, customers ... "
and ended up commenting out the o_cust_id = c_cust_id in the bottom of the where clause while they were messing about with it.
And another example of a CTE which was missing the ON clause, but the outer query did have a where clause and that was great till someone else cut-pasted the CTE into a materialized view & tried to rebuild it.
> Maybe some database somewhere cannot optimise queries properly unless JOIN is used?
Until Apache calcite came in, Apache Hive could not always find the table which was joined against out of a where clause (so you'd find TPC-H queries which were planned as cross-products between the same table with different aliases etc - like Query 7 was badly planned from the beginning of Hive till Hive 1.2 and only optimally planned in Hive 3.x).
But SQL engines have gotten much better over the years & most people should write it for readability than for execution order, but the readability is really why I like the ON syntax, particularly while chop-debugging a big query written by some poor guy who was translating COBOL into SQL.
And at least for the database we use at work, if the sole reason for a join is to reduce the data, prefer EXISTS.
Yes absolutely, and not just for performance benefits. It's much easier to track what, how, and why you're joining to something when it's not jumbled together in a list of a dozen conditions in the WHERE clause.
I can't tell you how much bad data I've had to fix because when I break apart the implicit conditions into explicit joins it is absolutely not doing what the original author intended and it would have been obvious with an explicit join.
And then in the explicit join, always be explicit about the join type. don't just use JOIN when you want an INNER JOIN. Otherwise I have to wonder if the author accidentally left off something.
I've seen a few websites, but I don't know which ones to use. Or maybe there is a dataset with practice question I could download?
Edit: I found https://pgexercises.com and it's been fantastic so far. Much more responsive than other sites, clear questions, and free.
Not having to fight assumptions along lines of improperly denormalized columns (i.e. which table is source of truth for a specific fact) can auto-magically simplify a lot of really horrible joins and other SQL hack-arounds that otherwise wouldn't be necessary. The essential vs accidental complexity battle begins right here with domain modeling.
You should be seeking something around 3rd normal form when developing a SQL schema for any arbitrary problem domain. Worry about performance after it's actually slow. A business expert who understands basic SQL should be able to look at and understand what every fact & relation table in your schema are for. They might even be able to help confirm the correctness of business logic throughout or even author some of it themselves. SQL can be an extremely powerful contract between the technology wizards and the business people.
More along lines of the original topic - I would strongly advocate for views in cases where repetitive, complex queries are being made throughout the application. These serve as single points of reference for a particular projection of facts and can dramatically simplify downstream queries.
Here's a better general performance tuning handbook - https://use-the-index-luke.com/
Use-the-index-luke is an altogether deeper, more technical article aimed at data engineers and going into the details and differences between databases.
I disagree. Professional developers should know their database of choice inside and out, and use-the-index-luke helps with that. You can skip the details about databases that aren't relevant to you.
Also a few things are dead wrong: the "make the haystack small" is optimization (it should be at the end, as the first rule says), the "prefer UNION ALL to UNION" is missing the context (a good dev knows what is needed, not what to prefer) and the usage of CTEs is nice, but sometimes slower that other options and in SQL slower can easily be orders of magnitude, so nice is not enough. Same for 'avoid sorting where possible, especially in subqueries' or "use composite indexes" (really? it's a basic thing, not a best practice).
In the past few months I interviewed and hired several DBAs, this list is ok-ish for a junior but a fail for a senior. I am not working for FAANG, so the bar is pretty low, this article would not even pass for a junior there.
CTEs were for a very long time an optimization fence in PostgreSQL, were not inlined and behaved more like temporary materialized views.
Only with release of PostgreSQL 12 some CTE inlining is happening - with limitations: not recursive, no side-effects and are only referenced once in a later part of a query.
Mode info: https://hakibenita.com/be-careful-with-cte-in-postgre-sql
I would love to know, so that I can know what optimizations and WHERE / JOIN conditions I should really be careful about making more efficient, versus others that I don't have to worry because the optimizer will take care of it.
For example, if I'm joining 2 long tables together, should I be very careful to create 2 subtables with restrictive WHERE conditions first, so that it doesn't try to join the whole thing, or is the optimizer taking care of that if lump that query all into one entire join and only WHERE it afterwards? How do you tell what columns are indexed and inexpensive to query frequently, and which are not? Is it better to avoid joining on floating point value BETWEEN conditions?
And other questions like this.
Databases have ways to query the schema which includes the index definitions, so you can know which columns and indexed (and the order of the columns in those indexes).
Unless you materialize a temporary table or materialized view or use a CTE with a planner that doesn't look inside CTEs, the planner will just "inline" your subqueries (what are "subtables"?) and it will not affect the way the join is performed.
Join on floating point value is quite rare. Why do you need to do that?
Ah, thanks for noticing this. They are, for example, (1) tables of timestamped events, and (2) tables of time ranges in which those events need to be associated with (but which unfortunately were not created with that in mind at the time)...
So for example FROM tableA LEFT JOIN tableB ON (timestampA BETWEEN timestampB1 AND timestampB2)
(and where the timestamps can be either floating point or integer nanoseconds)
One of the more important parts is simply understanding which indexes can be used in a query. The other part is understanding when the database will intentionally not use an index, this is mostly related to column statistics. The basics of indexes are pretty simple, but then there is a whole bunch of subtle details that can mean the index can't actually be used for your query.
Another useful part to understand is how much IO a query requires, EXPLAIN (ANALYZE, BUFFERS) is helpful for that. But you also need to understand a bit the layout Postgres uses to store data, how it is stored in pages, TOAST and related stuff.
For Postgres I'd really start with reading the manual on index types and on the statistics collector. After that I'd just play with explain analyze for queries you're writing.
The order of JOINS is optimized automatically in Postgres, but only up to a pointf, for a large number of joins it has to fall back to heuristics.
No database can find perfect join order when you have more than about 8 to 10 tables in the join.
10 or so years ago when SQL Server, Oracle & MySQL dominated the industry, you could talk about SQL optimization with the expectation that all advice was good advice. There are too many flavors of databases to do that today.
I despise table aliases and usually remove them from queries. To me, they add a level of abstraction that obscures the purpose of the query. They're usually meaningless strings generated automatically by the tools used by data analysts who rarely inspect the underlying SQL for readability. I fully agree that you should reference columns explicitly with the table name, which I think is the real point they're trying to make in the article.
While it's true that sorting is expensive, the downstream benefits can be huge. The ability to easily diff sorted result sets helps with troubleshooting and can also save significant storage space whenever the results are archived.
Short aliases - I tend to use the first letter of each word in the table name - work best, IMO.
In code, I like the length of the variable name to be proportional to the size of the scope. Small scope -- short variable names.
I feel it would be rather noisy to have to specify such table names in front of the 15+ column references in a query, compared to using aliases.
Then again I've never had to diff the result sets, so I guess our usage is quite different.
SELECT
title,
last_name,
first_name
FROM books
LEFT JOIN authors
ON books.author_id = authors.id
> Prefer SELECT
b.title,
a.last_name,
a.first_name
FROM books AS b
LEFT JOIN authors AS a
ON b.author_id = a.id
Couldn't disagree more. One letter abbreviations hurt readability IMO. SELECT t1.thing
, t2.stuff
, t3.stuffagain
, t4.more
FROM SomeTable t1
JOIN TableThatLinksToSelf t2 ON <join predicate>
JOIN TableThatLinksToSelf t3 ON <join predicate>
JOIN AnotherTable t4 ON <join predicate>
that is not the point that is being made here. The point is that explicitly naming tables is beneficial to understanding and reducing issues later. Short alias is preferable to not specifying column sources at all.I prefer descriptive table and other object names, and abbreviate them in aliases within queries (though usually not to single letters).
Sure it does not help to understand the origins of a given field without aliases, unless someone is very familiar with the schema.
If the tables names are long, by all means abbreviate them a little, but never just use 1 letter aliases.
I wonder if the author has ever worked with a system that has more than a handful of tables.
SELECT
bo.title,
au.last_name,
au.first_name
FROM books AS bo
LEFT JOIN authors AS au ON
bo.author_id = au.idIn simple cases that may be just the singular of the table name, e.g.:
SELECT
book.title,
author.last_name,
author.first_name
FROM books AS book
LEFT JOIN authors AS author ON
book.author_id = author.id
But in other cases, it will be different, e.g.: SELECT
manager.last_name || ', ' || manager.first_name AS manager_name,
coalesce(employee.title, 'All Titles') AS staff_title,
count(employee.id) AS count
FROM employees AS manager
LEFT JOIN employees AS employee ON
manager.id = employee.manager_id
GROUP BY manager.id, ROLLUP(employee.title)
HAVING employee.title IS NOT NULL
OR GROUPING(employee.title)=1Unless this specific to certain databases, LIKE can take advantage of indexes too, without wildcards LIKE should be nearly identical in performance to = both seeking the index.
>Using wildcards for searching can be expensive. Prefer adding wildcards to the end of strings. Prefixing a string with a wildcard can lead to a full table scan.
Which is contradictory to the first quote, it seems you recognize that a wildcard at the end can take advantage of an index. Full table scan is the same thing as not taking advantage of an index, hence LIKE can take advantage of normal indexes so long as there are characters before the first wildcard or has no wildcards.
At least for the latest versions of every database. If you go back to a version from 10+ years ago there's no guarantees.
It can only do a seek if there are character before the wildcard: 'ab%c', 'abc%' and 'abc' getting progressively faster due to less index entries transversed.
https://en.wikipedia.org/wiki/FLWOR
SELECT first_name FROM person WHERE first_name LIKE 'john'
becomes:
FROM person WHERE first_name LIKE 'john' SELECT first_name
SQL reads more English like while from first is more Yoda speak but the auto-complete is worth more to me.
Don’t most databases figure this out as part of the query planner anyway? Postgres has no problems using indexes for joins inside WHERE.
My guess is the author heard something about not using implicit inner joins (deprecated decades ago) and misunderstood.
E.g. This old syntax- SELECT * FROM a, b WHERE a.id = b. a_id
I wonder what the results would be if I ran the queries from this article through that tool.
SELECT foo
FROM bar
WHERE TRUE
AND baz > boom
For OR conditions it's a bit different: SELECT foo
FROM bar
WHERE FALSE
OR baz > boomIf you're already writing:
WHERE foo=bar
AND biz=baz
It's not clear to me how: WHERE TRUE
AND foo=bar
AND biz=baz
is worse. SELECT foo
FROM bar
WHERE 1=1
AND (1<>1
OR baz > boom
OR fizz >= bang
)
AND foo is not null
So you can comment out lines starting with OR individually. Some people might hate it but it makes sense conceptually for me since almost every query I write takes a chain of ANDs in the where clause as a starting point.For OR, I like to keep the "1=1" and do
AND (1=2
OR ...
)"a = 'foo'" is exactly the same performance as "a like 'foo'" and very close to the performance as "a like 'foo%'" and is fully indexed. When you put a wildcard in the front, the entire index is avoided, so you gotta switch to full text search.