How We Built a Cost-Based SQL Optimizer (opens in new tab)

(cockroachlabs.com)

198 pointsandydb7y ago52 comments

52 comments

43 comments · 13 top-level

jzelinskie7y ago· 9 in thread

I tend to dislike cost-based optimizers because they add a layer of uncertainty when you're attempting to optimize database queries. I know what I want the optimizer to do. The problem is that when I run test queries against a local database or even a staging database, the statistics used to calculate costs differ. This means production can do something totally undesirable.

andydbOP7y ago

Yes, that's a real problem that DBAs routinely grapple with. To address the problem, RDBMS vendors typically support hints and "plan management" features that allow DBAs to "pin" a particular plan once they've determined it's best. However, this usually requires a knowledgeable DBA; many developers would prefer that the database automatically chooses the plan, even if it changes from time to time. CockroachDB will definitely want to support both kinds of users.

mamcx7y ago

I think is best if it was a per-session hint, like:

SET PLANNER TO COST

or similar, and default to a more "static" one.

jzelinskie7y ago

Yeah, the problem is that basically none of this is portable, though. So if you're attempting to support both MySQL variants and Postgres variants, it gets nasty fast.

1 more reply

tkyjonathan7y ago

Hear, hear

wolf550e7y ago

Some databases let you just set the stats numbers to what you think they are and pin them, so changes to data won't change the stats and the optimizer will always use your stats and not stats calculated from the data. You'll get the "prod" execution plans on an empty database!

bobbydreamer7y ago

You can do that in Db2 z/os. In our shop we do it daily. In prod stats are up to date, we have job that would get all the stats from prod required for accesspaths and apply it on Dev empty tables modelling production. So Dev team when they bind packages which would use these stats to generate accesspaths 99.99% of the time same accesspaths would be generated in Dev and we know how the program would perform. Also for some tables we set pre defined stats, if that table a had real stats, oboy.

mars4rp7y ago

Sql server let you to import meta data from prod to dev. And you will get reliable execution plans.

grumpydba7y ago

> I know what I want the optimizer to do

For a cute little select statement, it might be true. However it's just not manageable most of the time.

jcriddle47y ago

Also Cost-Based can do some really dumb things if a table's contents change by quite a bit before stats are run again. I like being able to give hints so that you can switch between Cost or hierarchy based queries.

edmundhuber7y ago· 5 in thread

Great article, and it sounds like it took a pretty big leap of faith to go from heuristics to cost-based optimization. If this was addressed in the article I apologize: do you only ever select transformations that yield equivalent results, with immediately lower cost, or do you also explore transformations that incur in immediate cost increase, but then later pay off? i.e. are you doing simple hill climbing, or something more interesting?

andydbOP7y ago

Good question, and something I didn't have the space to cover in this article. The answer involves "physical properties". Physical properties are interesting characteristics of an expression that impact its layout, presentation, or location, but not its logical content. Examples include row order, column naming, and data distribution (physical location of data ranges in the cluster).

As an example, consider a memo group that contains both a MergeJoin and a HashJoin operator. The MergeJoin "requires" its inputs to be sorted on the same key. If an input is not already sorted, then we insert a "sort enforcer" operator that will provide the required ordering. Of course, that increases the cost of producing the input, which the optimizer must take into account.

So, getting back to your question, for any given memo group, we always select the equivalent plan with the lowest cost, for a given set of physical properties. However, that has the effect of sometimes selecting a memo group operator that has a higher cost, because it "pays off" higher up the operator tree. Back to my example, say the SQL query is something like this:

  SELECT * FROM a, b WHERE a.name=b.name ORDER BY a.name

The lowest cost plan might use a MergeJoin, because its output is already naturally sorted on "name". However, the input to the MergeJoin might be higher-cost than logical equivalents, because it must be sorted. Therefore, we have the situation you alluded to, where we choose a higher-cost subtree because it minimizes the cost of the larger tree that contains it.

SaifAlHarthi7y ago

Great article!. While reading it I see some similarities with "Plan Stitching" from MSR as shown here https://www.microsoft.com/en-us/research/uploads/prod/2018/0.... It is interesting to see what the performance differences between the two. Also, I am curious of why build a query optimizer from scratch when you can adopt or port something like Orca?

1 more reply

setr7y ago

>it sounds like it took a pretty big leap of faith to go from heuristics to cost-based optimization

Im pretty sure cost-based optimization is common enough and practically proven effective enough that its not really fair to call it a “leap of faith”; eg im pretty sure mysql and postgres both do it (and you can it see it in their EXPLAINs), and I imagine most commercial dbs do as well

Its more of just a natural progression as the db gets improved

Its a good article/work nonetheless, but nothing out of the ordinary in db implementation afaik

andydbOP7y ago

Postgres and MySQL both do calculate costs. One difference is that they are not based around the Memo data structure (someone correct me if I'm wrong about that!), which gives an extra measure of flexibility, power, and elegance to solving the problem of SQL optimization. The Memo structure was introduced by Goetz Graefe in his seminal papers on top-down, rule-based query optimization (Exodus, Volcano, Cascades), which are the chief inspiration for CockroachDB's new optimizer. In the late 90's, Goetz joined Microsoft, and assisted the SQL Server team in rewriting their optimizer from scratch based on the techniques he had pioneered. I believe this foundation is a big reason why the SQL Server optimizer has become one of the best (if not the best) in the world. We are indeed building on the shoulders of giants.

1 more reply

Tostino7y ago

Yes, Postgres, MySQL, MSSQL, Oracle, and i'm sure DB2 all use a cost based optimizer.

+1 on the sentiment, good to see them continuing to improve, they have lots of competition to look at for examples of what to do / not to do. Progress is good none the less.

shoo7y ago· 3 in thread

Maybe of interest - postgres planner documentation

https://www.postgresql.org/docs/current/planner-optimizer.ht...

Genetic query optimiser: https://www.postgresql.org/docs/current/geqo.html

postgres has a number of hooks, which can be used to override the default behaviour. In particular, there are a bunch of hooks that can be used to install a custom query planner. https://github.com/AmatanHead/psql-hooks/blob/master/Readme....

more generally, ignoring database query optimisation specifically, if you are interested in learning about discrete optimisation techniques, I recommend this course: https://www.coursera.org/learn/discrete-optimization

timClicks7y ago

Also of interest perhaps is a podcast episode from Software Engineering Radio with Bruce Momjian, who maintains the PostgreSQL query planner: http://www.se-radio.net/2018/06/se-radio-episode-328-bruce-m...

anarazel7y ago

He does not maintain the postgresql planner. He's a PostgreSQL developer, but he has not focused on the planner for at least a decade.

Edit: Grammar.

Tostino7y ago

Just listened to the podcast, really good listen which explains the basics really well. Highly recommend.

Thanks for the link. Wasn't aware of this podcast before.

ryanworl7y ago· 3 in thread

How do you manage statistics in the system catalog in CockroachDB?

radub7y ago

[Cockroach Labs engineer here, I worked on statistics]

All table statistics are stored in a system table (system.table_statistics). CREATE STATISTICS scans the table, calculating statistics on a column, and adds a new row to system.table_statistics.

We still have a ways to go in making statistics automatic (right now it requires a manual call to CREATE STATISTICS) and more useful to the optimizer (which currently only uses basic stats). We're working on that right now for the 2.2 release.

SaifAlHarthi7y ago

What if the tables were partitioned, do you calculate local statistics or you aggregate the stats and expose it to the optimizer?

Lemaxoxo7y ago

Are you planning to support multivariate statistics?

1 more reply

atombender7y ago· 2 in thread

I've been looking for books that cover database implementation techniques such as this — the "memo" algorithm was new to me, but I'm also generally interested in query planning and optimization. Anyone got any recommendations? I started a thread: https://news.ycombinator.com/item?id=18410692.

andydbOP7y ago

This is not a book, but we did write up some documentation for developers that gives a nice overview of how the optimizer works:

https://github.com/cockroachdb/cockroach/blob/master/pkg/sql...

Might be worth a read if you're interested in implementation techniques.

manigandham7y ago

Posted in that thread: https://news.ycombinator.com/item?id=18411110

TLDR: check out CMU DB group at https://db.cs.cmu.edu/

radiowave7y ago· 2 in thread

Excellent write-up. Thanks.

Has any thought been given to whether this query planner could be adapted (much further down the road, I'd guess) to support dynamic replanning? (That is, altering the plan mid-query, if it should be found that the row-count estimates were way off.)

andydbOP7y ago

We've talked about ideas along those lines. However, rather than altering the plan mid-query, we'd be more likely to have some kind of feedback mechanism that corrects estimates for the benefit of future plans.

Another commenter posted an interesting paper that's related to your question:

https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

radiowave7y ago

Thanks, I'll check that out.

georgewfraser7y ago· 2 in thread

Is the cost based planner relevant to OLTP-style queries, where everything is an indexed lookup, or only to OLAP-style queries that involve scans?

manigandham7y ago

Sure, there are plenty of OLTP queries involving joins and complex queries that can have severe performance impacts from bad query plans.

georgewfraser7y ago

I’m struggling to think of an example of an OLTP query where simple heuristics don’t work but I’m not an expert, can you give an example?

1 more reply

kunthar7y ago· 2 in thread

is still no avail to use sqlalchemy within your cockroach? ok go ahead with plans.

manigandham7y ago

CockroachDB uses the PostgreSQL protocol so you can reuse any Postgres drivers and tooling to connect.

They have an entire section in their documentation for Python and SQLAlchemy: https://www.cockroachlabs.com/docs/stable/build-a-python-app...

poooogles7y ago

sqlalchemy has a cockroach dialect.

cy6erlion7y ago· 1 in thread

This is like studying chess, the approach to studying the complexities that arise from the number of routes a query can be executed is like how one will study an opening, also the tree data structure mentioned is reminiscent to chess, statistics also hence studying GM games. I feel like someone with a good understanding of chess will enjoy such a project. Anyways great post and breakdown.

andydbOP7y ago

I think you're right; I'm a chess player, and working on the optimizer does hit some of the same neural pathways.

dzidol7y ago· 1 in thread

Shouldn't "unless both inputs are sorted on the same key" be "unless both inputs are sorted on the join key"? And similarly later.

andydbOP7y ago

Yes, that would be clearer language. I'll look into updating the text.

DreamSpinner7y ago

People generally interested in this topic might also find the CMU Intro to DB Systems youtube of interest.

Note that it's an introduction to building database systems (not just using them).

https://www.youtube.com/playlist?list=PLSE8ODhjZXja3hgmuwhf8...

millrawr7y ago

> outside expert on database optimizers run a months-long bootcamp, with the goal of teaching our developers how state-of-the-art optimizers work, complete with homework assignments to read seminal papers on the subject

Any chance you could share more about this? This general area of how to build a production grade SQL optimizer seems to be a thing that's more scattered in tiny pieces across a wide number of papers, or held in peoples' heads, than aggregated in a teaching manner. It seemed that the realistic general advice on how to build a SQL optimizer was to poach someone from the SQL Server team. ;)

I've generally just gone referring back to the unfinished Building Query Compilers[1] when pondering this subject. Not that the hundreds of references don't provide sufficient material to read though as well, but it'd be interesting to hear what a bootcamp like this presented as state of the art.

[1]: http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pd...

altitudinous7y ago

Ah, my past life and knowledge developing Oracle systems comes flooding back.

I don't know if Oracle still uses CBO now, or even SQL or PL/SQL but I am sure that a large chunk of their revenue still comes from supporting these legacy systems.

j / k navigate · click thread line to collapse

52 comments

43 comments · 13 top-level

jzelinskie7y ago· 9 in thread

andydbOP7y ago

mamcx7y ago

I think is best if it was a per-session hint, like:

SET PLANNER TO COST

or similar, and default to a more "static" one.

jzelinskie7y ago

Yeah, the problem is that basically none of this is portable, though. So if you're attempting to support both MySQL variants and Postgres variants, it gets nasty fast.

1 more reply

tkyjonathan7y ago

Hear, hear

wolf550e7y ago

bobbydreamer7y ago

mars4rp7y ago

Sql server let you to import meta data from prod to dev. And you will get reliable execution plans.

grumpydba7y ago

> I know what I want the optimizer to do

For a cute little select statement, it might be true. However it's just not manageable most of the time.

jcriddle47y ago

edmundhuber7y ago· 5 in thread

andydbOP7y ago

  SELECT * FROM a, b WHERE a.name=b.name ORDER BY a.name

SaifAlHarthi7y ago

1 more reply

setr7y ago

>it sounds like it took a pretty big leap of faith to go from heuristics to cost-based optimization

Its more of just a natural progression as the db gets improved

Its a good article/work nonetheless, but nothing out of the ordinary in db implementation afaik

andydbOP7y ago

1 more reply

Tostino7y ago

Yes, Postgres, MySQL, MSSQL, Oracle, and i'm sure DB2 all use a cost based optimizer.

+1 on the sentiment, good to see them continuing to improve, they have lots of competition to look at for examples of what to do / not to do. Progress is good none the less.

shoo7y ago· 3 in thread

Maybe of interest - postgres planner documentation

https://www.postgresql.org/docs/current/planner-optimizer.ht...

Genetic query optimiser: https://www.postgresql.org/docs/current/geqo.html

timClicks7y ago

anarazel7y ago

He does not maintain the postgresql planner. He's a PostgreSQL developer, but he has not focused on the planner for at least a decade.

Edit: Grammar.

Tostino7y ago

Just listened to the podcast, really good listen which explains the basics really well. Highly recommend.

Thanks for the link. Wasn't aware of this podcast before.

ryanworl7y ago· 3 in thread

How do you manage statistics in the system catalog in CockroachDB?

radub7y ago

[Cockroach Labs engineer here, I worked on statistics]

All table statistics are stored in a system table (system.table_statistics). CREATE STATISTICS scans the table, calculating statistics on a column, and adds a new row to system.table_statistics.

SaifAlHarthi7y ago

What if the tables were partitioned, do you calculate local statistics or you aggregate the stats and expose it to the optimizer?

Lemaxoxo7y ago

Are you planning to support multivariate statistics?

1 more reply

atombender7y ago· 2 in thread

andydbOP7y ago

This is not a book, but we did write up some documentation for developers that gives a nice overview of how the optimizer works:

https://github.com/cockroachdb/cockroach/blob/master/pkg/sql...

Might be worth a read if you're interested in implementation techniques.

manigandham7y ago

Posted in that thread: https://news.ycombinator.com/item?id=18411110

TLDR: check out CMU DB group at https://db.cs.cmu.edu/

radiowave7y ago· 2 in thread

Excellent write-up. Thanks.

andydbOP7y ago

Another commenter posted an interesting paper that's related to your question:

https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

radiowave7y ago

Thanks, I'll check that out.

georgewfraser7y ago· 2 in thread

Is the cost based planner relevant to OLTP-style queries, where everything is an indexed lookup, or only to OLAP-style queries that involve scans?

manigandham7y ago

Sure, there are plenty of OLTP queries involving joins and complex queries that can have severe performance impacts from bad query plans.

georgewfraser7y ago

I’m struggling to think of an example of an OLTP query where simple heuristics don’t work but I’m not an expert, can you give an example?

1 more reply

kunthar7y ago· 2 in thread

is still no avail to use sqlalchemy within your cockroach? ok go ahead with plans.

manigandham7y ago

CockroachDB uses the PostgreSQL protocol so you can reuse any Postgres drivers and tooling to connect.

They have an entire section in their documentation for Python and SQLAlchemy: https://www.cockroachlabs.com/docs/stable/build-a-python-app...

poooogles7y ago

sqlalchemy has a cockroach dialect.

cy6erlion7y ago· 1 in thread

andydbOP7y ago

I think you're right; I'm a chess player, and working on the optimizer does hit some of the same neural pathways.

dzidol7y ago· 1 in thread

Shouldn't "unless both inputs are sorted on the same key" be "unless both inputs are sorted on the join key"? And similarly later.

andydbOP7y ago

Yes, that would be clearer language. I'll look into updating the text.

DreamSpinner7y ago

People generally interested in this topic might also find the CMU Intro to DB Systems youtube of interest.

Note that it's an introduction to building database systems (not just using them).

https://www.youtube.com/playlist?list=PLSE8ODhjZXja3hgmuwhf8...

millrawr7y ago

[1]: http://pi3.informatik.uni-mannheim.de/~moer/querycompiler.pd...

altitudinous7y ago

Ah, my past life and knowledge developing Oracle systems comes flooding back.

I don't know if Oracle still uses CBO now, or even SQL or PL/SQL but I am sure that a large chunk of their revenue still comes from supporting these legacy systems.

j / k navigate · click thread line to collapse