Showdown: MySQL 8 vs. PostgreSQL 10 (opens in new tab)

(blog.dumper.io)

101 pointskenn8y ago97 comments

97 comments

59 comments · 13 top-level

foxylion8y ago· 26 in thread

The article contains a footnote about UUIDs as primary keys.

> UUID as a primary key is a terrible idea, by the way — cryptographic randomness is utterly designed to kill locality of reference, hence the performance penalty

Is there anyone who can go a little bit more in detail?

We planned to migrate our database to use UUIDs as primary keys. This will allow creating new rows on clients knowing the new primary key before sending them to the server (simplifying client and server code).

default-kramer8y ago

(The author is assuming that the primary key controls disk layout, which is usually true.) One advantage of using an incrementing integer is that rows will be ordered on disk based on when they were created. This often helps performance. If a query asks for 25 consecutive rows, there is a good chance they will all be on the same page. If you use UUIDs, then they could be on 25 different pages and you will have to do 25x the disk IO to handle the query.

da_chicken8y ago

> One advantage of using an incrementing integer is that rows will be ordered on disk based on when they were created.

Well, kind of. A lot of people think the auto incrementing integer function in many RDBMSs will always increase, or will never have gaps. It's likely but not guaranteed that n+k was created after n. If you really need to store the creation date, then you should store that in a datetime/timestamp column.

> If a query asks for 25 consecutive rows, there is a good chance they will all be on the same page. If you use UUIDs, then they could be on 25 different pages and you will have to do 25x the disk IO to handle the query.

This is true, but it also means that if you need to write 25 different rows, it will be in 25 different pages. That sounds bad because non-sequential writes are slower, but you have to remember that it could be 25 different connections trying to write! In other words, you create a hot spot with sequential inserts. If that's the end of the table, you'll have threads constantly waiting for other processes to do inserts since inserts lock the page being inserted.

So, yes, clustering on a UUID can cause problems (fragmented indexes, inefficient reads), but clustering on an autoincrement can also cause issues depending on your work load.

In reality, what you need to do (in the general case) is cluster on your business key even if it's not the primary key for your table.

1 more reply

zzzcpan8y ago

> One advantage of using an incrementing integer is that rows will be ordered on disk based on when they were created

If each identifier starts with a logical time, say lamport timestamp, you can still get the same ordering effect without incrementing integers in a centralized place somewhere.

2 more replies

masklinn8y ago

> Is there anyone who can go a little bit more in detail?

* UUIDs are way more painful than serials to recognise, remember, input or transmit especially if you're not dealing with huge tables. "18574" is easy to read/grok, "21caeffa-0fca-4f4e-b845-46ef0576e42a" is not.

* UUID are 128 bit instead of 32 for most serial PKs by default, this may or may not matter. Note that this doesn't just impact the table itself (lowering data locality and thus performanes: less stuff fits into caches) but also any FK as well as wire transmission (where the size explodes as you're going to transmit a hex version of the UUID so always at least 32 bytes, a decimalised u32 maxes out at 10 bytes).

* because UUID are random data they're intrinsically non-ordered (as opposed to serial ids which… are), this means your writes are all over the place and clustering is defeated, this adversely impacts your reads and writes in SQL dbs (some other DB techs especially distributed ones prefer the scattering: https://news.ycombinator.com/item?id=14524174)

The latter can he mitigated by using "ordered UUIDs": you can generate UUID1 (nominally time-based) such that the final value has a sequential "head" and a random "tail", either by taking control over the process or by generating a regular UUID1 and rejiggering it a bit: https://stackoverflow.com/questions/412341/how-should-i-stor...

scott1133418y ago

> UUIDs are way more painful than serials to recognise, remember, input or transmit...

Completely agree, but I've found this can be a non-technical "feature" too. Serial integer primary keys are much more susceptible to human error when doing any sort of direct database manipulation.

Make a typo on a integer PK? Wrong user gets deleted. UUID typo? Row not found (almost certainly).

Another source of error I've seen is when someone in sales asks "Hey, can you remove User #1234?", but they really meant Customer #1234." With UUIDs, there's no "collision" between the tables.

Clearly there are better process/tool-based ways to prevent these types of mistakes, but it's a useful side effect of UUIDs.

1 more reply

netcraft8y ago

> UUIDs are way more painful than serials to recognise, remember, input or transmit especially if you're not dealing with huge tables. "18574" is easy to read/grok, "21caeffa-0fca-4f4e-b845-46ef0576e42a" is not.

One of the reasons we use them is because theyre not easily recognized or sequential.

RedCrowbar8y ago

> The latter can he mitigated by using "ordered UUIDs": you can generate UUID1 (nominally time-based) such that the final value has a sequential "head" and a random "tail"

Specifically, in PostgreSQL such UUIDs can be generated using uuid_generate_v1mc()

1 more reply

doh8y ago

I would suggest to write your own UUID generator. Instagram published their own approach [0], we are using something very similar to it. It seems to originate at Facebook, as the IDs generated from both instagram and FB are very similar. It can be completely automated in Postgres (probably in Mysql too).

On the plus side, because it's timestamp based, you can use the generated IDs in sorting and paging as there is a guarantee that each passing second will yield larger bigint. One caveat. If you are going to use it in Javascript, make sure you send it as string as Javascript only supports 53bit integers (due the fact that all integers in Javascript are floating points).

[0] https://instagram-engineering.com/sharding-ids-at-instagram-...

th3sly8y ago

Or you can shuffle it and convert to binary using uuid_to_bin like in this example uuid_to_bin(uuid(), true): https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

simonw8y ago

The one time I used UUIDs as primary keys I quickly regretted it. They're huge, they're not possible to manually enter ("which row ID was causing that bug?") and they make foreign key relationships ugly and large too.

If you want to generate IDs independently of the database you can do so using a "ID generator" mechanism.

Set up three redis instances (or MySQL or anything else that can increment a counter). Have each one increment by three each time. Start then at 0, 1 and 2.

Now you can ask any of those theee instances for a new ID and you'll get one that has not been used before, thanks to them being offset from each other.

I first saw this technique used by Flickr when they switched to a Shaffer database.

gldalmaso8y ago

I believe one of the sought after features when people want to generate ids outside the database is for offline apps, so there is no way to connect to outside counters. The client should have a way to locally generate an id that is final and will be used for the rest of the entity's life. This simplifies the synchronization with the backend because otherwise there must be a (temporary) client id and a final true id. This leads to very convoluted code that deals with two logically different entities, the "pending" ones and the "synced" ones.

parvenu748y ago

UUID (or GUID on SqlServer) are essentially 128-bit integers. The fact they don't have inherent meaning beyond uniquely identifying a row and establishing referential integrity is a plus. In my opinion the primary key is for the database's needs; a human-friendly number or code for lookup is both optional and trivial to implement so when there's a requirement for an easy reference number or code for a record I use both a GUID for the primary key and a unique (or autonumber) field for human reference.

firebacon8y ago

Is "Shaffer" a typo of "sharded"? If not I'm curious what the term means (all that google turns up are a bunch of papers by a guy with that name)?

mbroshi8y ago

There are functional benefits for having UUIDs as a primary key, but yes there are performance impacts on writes and ORDER BY. The best way to find out how it will impact your application is to have performance tests in place, and test out the primary key change in a development environment. I do not think you'll able to determine the impact on performance/scalability based on "pure thought".

parvenu748y ago

Why would you want to ORDER BY on a UUID field? Not trolling, I honestly can't think of a reason why you would want to do this. Secondly, aren't UUIDs treated by the database engine as a 128-bit integer? If they are being treated as varchar fields then I can see how this would affect performance negatively but again, I question if this is really the case.

1 more reply

pletnes8y ago

Why not generate an UUID field that’s unique, but keep a surrogate integer primary key on the server? You can join on either one on the server, but keep track of the rows using the UUID. Asking as someone with no SQL expertise (relative to the HN crowd).

cosmie8y ago

That's generally the route I've gone in the past, particularly for complex data models. The entity has a uuid field with a unique constraint along with an auto-incrementing int/bigint for a primary key. The UUID is what gets exposed for any external/API usage, but all internal join logic leverages the int primary key.

That way you mitigate the potential performance impact of doing complex joins on UUID fields. While also gaining a bit of flexibility in any future change management process by decoupling your internal database ids from any publicly exposed id. So instead of e.g. having to coordinate with external applications to ensure things don't break when you switch your id from an int to a bigint, you keep the uuid consistent and internal database optimizations and changes stay transparent to stakeholders of the database.

nvivo8y ago

I use UUIDs in mysql as primary keys in tables with a few million records, and I'm about to migrate a few tables with billions of records from bigint to uuid.

Never saw any real problem. I use char(36), as it's easier to query manually when needed, but I' looking into binary(16) for those billion row tables.

I think most issues with fragmented tables are old problems since ssd, and the overhead is something you will only notice in benchmarks.

If your keys ate supposed to be uuids, the just use them and get the hardware to handle it. In reality, you're most likely to be affected by a zillion other things before an uuid as pk.

da_chicken8y ago

> I use char(36), as it's easier to query manually when needed, but I' looking into binary(16) for those billion row tables.

I would use whatever data type your RBDMS's UUID generator returns or the programming language your application is written in. If your RDBMS supports a UUID or GUID data type, however, I would 100% use that because you'll invariably have functions which help you deal with it.

Remember, however, that many (most?) RDBMSs store records in pages (or blocks) of a fixed size typically between 4KB or 8 KB, and they won't allow a record to span a page (usually when a record is too long for one page, non-key data will be moved to non-paged storage which is slower). In other words, if you reduce your record size by 20 bytes you might not actually see as big a change as you'd expect. You'd be storing less data per record, but you're maybe not changing the records per page. You're not increasing the efficiency of your data store at all because of how the data is physically stored. It also means that the answer might be different for each table since each table has a different row size.

Bottom line, however, is that I would favor storing UUIDs the way your particular RDBMS vendor tells you they should be stored. If your application has particular problems with storing UUIDs that way I would look at alternatives, but generally the RDBMS vendors have thought about this a little bit at least.

1 more reply

diaz8y ago

I can't comment on uuid as I never used that for ids. But, I can tell you that asking for a next ID from a postresql primary key sequence will make sure it is never used again because it is an atomic operation. Once a next value is requested and you keep it, you can be sure it will not be generated again. I've used this for many things.

netcraft8y ago

we use both uuid's created in our application layer and stored as char(35) and also uuid's produced in the database stored as uuid data types. We have tens of millions of records and have no issues. Once we get upgraded to PG10 and can use hash indexes for them we expect a further speedup.

I don't know anything about the author, but every time in the past ive heard someone say that uuid's as keys are a problem it turns out theyve never actually tried it, theyre just saying what theyve heard in the past. Theoretically they should be worse than they are - and theres no doubt that a regular int/bigint would be faster - but the truth is there are so many other things that are going to slow you down before that.

jerrysievert8y ago

uuid's aren't guaranteed to be unique at generation, there is still a non-zero chance of it having a collision. using it as a primary key to be generated by the database helps mitigate that, as there will normally be a uniqueness clause on the index.

creating that uuid on the client likely will not accomplish what you're hoping.

mbroshi8y ago

It's pretty unlikely to get a collision[1], and the client should be able to handle the gracefully (regenerate the UUID and retry).

[1] https://www.quora.com/Has-there-ever-been-a-UUID-collision

2 more replies

dfee8y ago

Primary keys on postgresql are unique (and non null). It’s not possible to change that behavior as it’s the definition of a PK field.

You can thus create more UUID fields that are unique and non-null by specifying that instead of “primary key” on column creation.

https://www.postgresql.org/docs/8.1/static/ddl-constraints.h...

firenotire8y ago

Can't you just enforce uniqueness anyways for that field? And then have the client retry?

1 more reply

zzzcpan8y ago

> uuid's aren't guaranteed to be unique at generation

If they are not unique, they are not really UUIDs. In which case you should tweak the algorithm to make uniqueness guaranteed. Like add a client id and its logical time in there.

1 more reply

parvenu748y ago· 7 in thread

I like that PostgreSQL can have both relational table and JSONB document collections (NoSQL) in the same database. Use NoSQL where it makes sense and relational tables for data that is inherently relational and query and join both (or batch-process from one to the other). I find this very cool.

Of course I wonder if it's too much cool and in trying to do everything it's falling short in some significant and fundamental way.

heyoni8y ago

Wow, I had no idea it could store jsonb. Any obvious advantages to using mongodb for nosql that I should consider? I’m way more comfortable with Postgres and would rather stick to that

tscs378y ago

Keep in mind that JSONB is different than JSON in PG. The later will not transform the JSON simply check consistency and has limits and performance hits.

JSONB is encoded and will transform your JSON, it will drop duplicate keys and order might not be preserved. But it's also faster and easier to use (IMO).

There is no real reason to use MongoDB tbh, PG JSONB can be handled like any other field and you can even index into JSON (partial indices even; only index rows where a field is present in the JSONB)

1 more reply

amunicio8y ago

Warning: I haven't used MongoDB in several years, so this info can be outdated!

The main advantage of using JSONB on Postgres over MongoDB is that you can create tables that mix regular fields (varchar, number, date, etc...) with JSONB fields. Then you can do joins of your table with other tables (no need to map/reduce or other insane processing for a simple join).

2 more replies

DrJosiah8y ago

I have not found a situation where mongodb has been a better choice than postgresql.

1 more reply

netcraft8y ago

jsonb functionality in PG is fantastic. It does everything youd expect, plus you can do a lot of other nifty things with it that you normally cant do with relational dbs very easily.

quetzthecoatl8y ago

Storage and thus speed at the expense of memory utilisation. RAM vs HDD.

derekperkins8y ago

MySQL has close to feature parity in terms of JSON functionality too.

asah8y ago· 4 in thread

Great to see MySQL adding this stuff! There's still a ton of reasons to choose Postgres and more and more silicon valley startups seem to be choosing pg - I don't remember the last time I met a startup choosing MySQL.

DBMSs are giant complex pieces of software with a million features - it's really hard to compare them. But if I had to sum it up, you can dump freaking line noise into Postgres and then hide the nastiness and manage and query it like a well-designed database. If you have a pretty database, good for you, but today's app writers have gotten lazy with NoSQL record stores and their databases resemble vomitoriums - and let's not talk about what people used to do in the 80s and early 90s.

Without further ado, Postgres vomitorium cleanup features:

- user defined functions/aggegrates/windowfuncs in your favorite language incl JavaScript, which means you can write tricky business logic once and run it where the data is, vs pulling out millions of records from the database. Language list: https://www.postgresql.org/docs/10/static/external-pl.html

- foreign data wrappers with hundreds of connectors AND a 5 minute toolkit for authoring new wrappers in python and other scripting languages. https://wiki.postgresql.org/wiki/Foreign_data_wrappers http://multicorn.org/

- index goddamned anything. Postgres has the most array of index types of any open source database AND if you need, you can easily write a function (in javascript or python!!!) and create an index that's the result of that function call. Postgres even has a full range of partial indices and block range indices, which make it practical to index massive and sparse datasets. https://www.postgresql.org/docs/current/static/indexes-parti... https://www.postgresql.org/docs/current/static/indexes-expre... https://www.postgresql.org/docs/current/static/brin-intro.ht...

EXAMPLE: I was once handed a MySQL database of IoT signals where timestamps were in seconds since the epoch and asked to report on this data without changing the database. No biggie: 15 minutes to replicate this OLTP database to a read-only Postgres replica (dumb schema mapping) - then a Postgres VIEW to hide this nastiness and a function index on the timestamp column (to_timestamp).

(from memory) CREATE VIEW foo AS SELECT *, to_timestamp(mysql_ts_col) as ts_timestamp FROM replicated_mysql_table; CREATE INDEX foo_ts_inx ON replicated_mysql_table(to_timestamp(mysql_ts_col));

At another company, we got dumped a load of JSON and weren't sure how we'd need to parse it. No biggie, I just created indices using function calls that parsed the JSON.

- tons of native datatypes and extensible datatypes. In cases where you're handed complex structures or "weird" data that doesn't behave like most programmers expect, you can define new datatypes, then create a library of user defined functions around them. https://www.postgresql.org/docs/9.5/static/xtypes.html

- sampling. Postgres has native, low-level support for queries that sample the data, which makes it super fast to explore data while preserving some semblance of statistics. https://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2...

- EXPLAIN. The Postgres planner/optimizer is still the king at explaining why your query is taking forever and what you can do about it. Admittedly, this stuff quickly gets arcane, but you can post your EXPLAIN output to a forum and guys like me will tell you how to override the JOIN order, update statistics, etc. https://www.postgresql.org/docs/current/static/using-explain...

Finally, no conversation about Postgres vs <x> is complete without mentioning that ALMOST EVERY FEATURE IN POSTGRES WORKS WITH EVERY OTHER FEATURE, which means you don't waste hours investing in something and then "oops" your carefully written user defined function can't be invoked in some obscure place - with Postgres, you can assume everything just works and will keep working. There's few pieces of software that can claim this.

meritt8y ago

> EXAMPLE: I was once handed a MySQL database of IoT signals where timestamps were in seconds since the epoch and asked to report on this data without changing the database

I'm not following why you moved the data into postgres other than to say you did? Are you suggesting that because you were restricted from making schema changes to the MySQL instance that that's a reason why postgres is superior?

cosmie8y ago

The reason for the migration was to be able to leverage the Functional/Expression Index[1] capability of Postgres. Any type of aggregated time series reporting queries were likely painful from a performance perspective, since an index on the epoch timestamp would have limited usefulness from an optimization standpoint.

So instead, he leveraged the expression index functionality of Postgres to pre-materialize an index against the converted timestamp. He didn't touch the table structure itself so it's transparent, but gets the performance benefits of that index already existing.

MySQL doesn't support function based indexes directly, although you can achieve a similar result in newer versions of MySQL with an intermediate step. You can create a Generated Column[2] first, and then build an index against that. If you specify it as a virtual generated column, then it's essentially the same as the above process where the column isn't physically stored, but you can index it. That said, asah may still not have been able to do that if the version of MySQL was too old or even that level of schema change was not allowed.

[1] https://www.postgresql.org/docs/current/static/indexes-expre...

[2] https://dev.mysql.com/doc/refman/5.7/en/create-table-generat...

fermuch8y ago

I read it as he had no way to alter mysql, and he needed to work with time functions, which is tricky without a real timestamp instead of an integer, and is impossible without correct indexes.

To me it seems like, in that case, he chose postgres since he could sync the data and use a custom schema over it, with proper types and indexes.

atombender8y ago

> their databases resemble vomitoriums

Curious what you mean by this. Vomitoriums had nothing to do with vomit, except in an etymological sense.

patrickg_zill8y ago· 3 in thread

I am pretty sure that PG has had clustered indexes for a decade or more... ? e.g. https://www.postgresql.org/docs/9.1/static/sql-cluster.html

Or is this term referring to a different feature/method than this?

One thing not mentioned: PL/SQL vs. whatever the MySQL equivalent is.

da_chicken8y ago

You're correct that PostgreSQL has had clustered indexes for quite a long time. The only difference here is that PostgreSQL allows a HEAP table. That said, you do have to re-cluster a PostgreSQL and that requires an exclusive lock on the table, which is obviously not idea for a massive table. MySQL will always cluster on the PRIMARY KEY or (if there isn't one) the first UNIQUE key[0], but as far as I can tell it always clusters on write. I don't see any way to configure the padding of the clustered index like you can on SQL Server, so I'm not sure how this is accomplished.

MySQL only supports SQL in procedures[1]. There is no PL/pgSQL[2] equivalent, and except for custom UDFs written in C/C++, there's no support for external procedures, either. PostgreSQL[3] supports PL/pgSQL, PL/Python, PL/Tcl, and PL/Perl in base, plus there's external modules for PL/Java, PL/Lua, PL/R, PL/sh, and PL/v8.

[0]: https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.h...

[1]: https://dev.mysql.com/doc/refman/8.0/en/adding-functions.htm...

[2]: https://www.postgresql.org/docs/current/static/plpgsql-overv...

[3]: https://www.postgresql.org/docs/current/static/xplang.html

dspillett8y ago

This is not the same thing - it is rearranging an existing index for efficiency as a one-off process. It needs to be repeated when the data is substantially changed.

With a true clustered index the clustering property is as far as possible (it can get somewhat fragmented in the presence of random data) maintained during normal operation without the need for a full rebuild every now and then to keep the benefits for new data.

> When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.

This makes that operation very nasty. For a large amount of data you are looking at locking your applications out of the database for some time, and the delay is relative to the total data size in the table being acted upon, not the amount of data that has recently arrived or changed.

nvivo8y ago

I think it's different. From the docs, this is a one time update to the table. The table itself is not clustered, it's just ordered based on a clustered index, so selects by default should come in that order. But once an insert is done, it's again ouy of order.

Also, an order by using the cluster field most probably invokes an index scan, while a clustered table doesn't.

mbell8y ago· 2 in thread

The connection comments are a bit dubious. MySQL will use less memory for 1000 connections but performance will still drop due to contention and context switching. In both systems you want a small number of connections to the actual database, something on the order of 1-2x cpu cores usually, and something on top pooling client connections if you need a lot of them, pgbouncer or the equivalent for MySQL.

nvivo8y ago

I had some real issues with mysql handling more than 2000 connections. Once past that limit, cpu usage increases exponentially with few connections due to context switching. At 3000 connections, 80% of thr cpu was used gor context switching and it got unusable. That was a 64 core/256gb ram server.

stevenwoo8y ago

What did you do to address the mysql connection problem?

1 more reply

hodgesrm8y ago· 2 in thread

> [2] When I say Postgres is great for analytics, I mean it. In case you don’t know about TimescaleDB, it’s a wrapper on top of PostgreSQL that allows you to INSERT 1 million records per second, 100+ billion rows per server. Crazy stuff. No wonder why Amazon chose PostgreSQL as its base for Redshift.

Correction: Amazon chose ParAccel, which was a data warehouse forked from PostgreSQL.

Many data warehouse products have followed this path due to licensing. MySQL is GPLv2 which means you can't ship derivative works without releasing your code. PostgreSQL has a permissive license similar to MIT/BSD. You can do anything you want with the code. That's still a major consideration which the article omitted.

(Cross-posted from another HN link to same article.)

chrisjc8y ago

Also, inserting directly into Redshift is strongly discouraged as it's extremely non-performant.

>An anti-pattern is to insert data directly into Amazon Redshift, with single record inserts or the use of a multi-value INSERT statement, which allows up to 16 MB of data to be inserted at one time. These are leader node–based operations, and can create significant performance bottlenecks by maxing out the leader node network as data is distributed by the leader to the compute nodes.

https://aws.amazon.com/blogs/big-data/top-10-performance-tun...

olavgg8y ago

Postgresql is good for analytics, but it doesn't scale really well with a lot of data. I have moved my analytics to Clickhouse, 1000x better performance.

sufehmi8y ago· 2 in thread

What about upgrading to a newer version of PostgreSQL ? Does that still require upgrading the whole databases ?

I evaluated PostgreSQL several times in the past, and cancelled once I found out that upgrading to a new version requires upgrading the whole databases - our databases are too big and our uptime requirement are too strict, we can not afford it

manquer8y ago

You can use logical replication between 2 versions and fail over to the new version without any downtime . In pg 10+ it is built into core, for earlier versions you can use pglogical or similar tools.

sufehmi8y ago

Great idea - I'll try to see if this is doable in our situation. Thanks.

garyclarke278y ago

Interesting but obviously biased in favour of MySQL. I disagree with some Postgres (which I know v well not MySQL though) comments. Wrong everyone does not use Sequential Integer Primary Keys, because they are completely useless as a Set constraint, I use real world logic. Update is irrelevant to me, so Vacuum not a problem, I use append only ie immutable database, the upside (not mentioned) is that this architecture makes PostgreSQL rock solid, can stay up for years unlike for example MS SQL Server that requires regular reboots and often fails to come backup cleanly. Also provide DDL transactions with rollback. I’m curious does MySQL now have Check Constraints, Arrays, Drop Schema with cascade (a huge timesaver), V8 PL, Lateral Joins, Range Types, Custom Aggregates (so powerful) SQL Function inlining, Parallel Query, Functional Indexes, Exclusion Constraints (eg no period overlap), Notify/Listen, Foreign Data Wrappers?

saosebastiao8y ago

Referenced in the article:

https://wiki.postgresql.org/wiki/Future_of_storage

This reads like all my dreams come true.

cat1998y ago

0 mention of language or datatype extensibility, which is huge.

cat1998y ago

this states all of the drawbacks to process vs thread but none of the benefits (resiliency / compartmentalization of errors, less need for lock coordination and less risk of locking related bottlenecks with scale, somewhat better host OS CPU & IO utilization, etc. )

dspillett8y ago

> With a clustered index, when you look up a record by the primary key

To nit-pic - it might be the case in mySQL but some DBs (SQL Server for instance) allow the clustering key to be something other than the primary key, and for some analytical workloads this can be much more efficient.

dspillett8y ago

> UUID as a primary key is a terrible idea, by the way

Unless you use a v1 UUID, i.e. via NEWSEQUENTIALID() in SQL Server. IIRC postgres has an equivelant available as a standard module.

j / k navigate · click thread line to collapse

97 comments

59 comments · 13 top-level

foxylion8y ago· 26 in thread

The article contains a footnote about UUIDs as primary keys.

> UUID as a primary key is a terrible idea, by the way — cryptographic randomness is utterly designed to kill locality of reference, hence the performance penalty

Is there anyone who can go a little bit more in detail?

default-kramer8y ago

da_chicken8y ago

> One advantage of using an incrementing integer is that rows will be ordered on disk based on when they were created.

So, yes, clustering on a UUID can cause problems (fragmented indexes, inefficient reads), but clustering on an autoincrement can also cause issues depending on your work load.

In reality, what you need to do (in the general case) is cluster on your business key even if it's not the primary key for your table.

1 more reply

zzzcpan8y ago

> One advantage of using an incrementing integer is that rows will be ordered on disk based on when they were created

If each identifier starts with a logical time, say lamport timestamp, you can still get the same ordering effect without incrementing integers in a centralized place somewhere.

2 more replies

masklinn8y ago

> Is there anyone who can go a little bit more in detail?

scott1133418y ago

> UUIDs are way more painful than serials to recognise, remember, input or transmit...

Completely agree, but I've found this can be a non-technical "feature" too. Serial integer primary keys are much more susceptible to human error when doing any sort of direct database manipulation.

Make a typo on a integer PK? Wrong user gets deleted. UUID typo? Row not found (almost certainly).

Another source of error I've seen is when someone in sales asks "Hey, can you remove User #1234?", but they really meant Customer #1234." With UUIDs, there's no "collision" between the tables.

Clearly there are better process/tool-based ways to prevent these types of mistakes, but it's a useful side effect of UUIDs.

1 more reply

netcraft8y ago

One of the reasons we use them is because theyre not easily recognized or sequential.

RedCrowbar8y ago

> The latter can he mitigated by using "ordered UUIDs": you can generate UUID1 (nominally time-based) such that the final value has a sequential "head" and a random "tail"

Specifically, in PostgreSQL such UUIDs can be generated using uuid_generate_v1mc()

1 more reply

doh8y ago

[0] https://instagram-engineering.com/sharding-ids-at-instagram-...

th3sly8y ago

Or you can shuffle it and convert to binary using uuid_to_bin like in this example uuid_to_bin(uuid(), true): https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

simonw8y ago

If you want to generate IDs independently of the database you can do so using a "ID generator" mechanism.

Set up three redis instances (or MySQL or anything else that can increment a counter). Have each one increment by three each time. Start then at 0, 1 and 2.

Now you can ask any of those theee instances for a new ID and you'll get one that has not been used before, thanks to them being offset from each other.

I first saw this technique used by Flickr when they switched to a Shaffer database.

gldalmaso8y ago

parvenu748y ago

firebacon8y ago

Is "Shaffer" a typo of "sharded"? If not I'm curious what the term means (all that google turns up are a bunch of papers by a guy with that name)?

mbroshi8y ago

parvenu748y ago

1 more reply

pletnes8y ago

cosmie8y ago

nvivo8y ago

I use UUIDs in mysql as primary keys in tables with a few million records, and I'm about to migrate a few tables with billions of records from bigint to uuid.

Never saw any real problem. I use char(36), as it's easier to query manually when needed, but I' looking into binary(16) for those billion row tables.

I think most issues with fragmented tables are old problems since ssd, and the overhead is something you will only notice in benchmarks.

If your keys ate supposed to be uuids, the just use them and get the hardware to handle it. In reality, you're most likely to be affected by a zillion other things before an uuid as pk.

da_chicken8y ago

> I use char(36), as it's easier to query manually when needed, but I' looking into binary(16) for those billion row tables.

1 more reply

diaz8y ago

netcraft8y ago

jerrysievert8y ago

creating that uuid on the client likely will not accomplish what you're hoping.

mbroshi8y ago

It's pretty unlikely to get a collision[1], and the client should be able to handle the gracefully (regenerate the UUID and retry).

[1] https://www.quora.com/Has-there-ever-been-a-UUID-collision

2 more replies

dfee8y ago

Primary keys on postgresql are unique (and non null). It’s not possible to change that behavior as it’s the definition of a PK field.

You can thus create more UUID fields that are unique and non-null by specifying that instead of “primary key” on column creation.

https://www.postgresql.org/docs/8.1/static/ddl-constraints.h...

firenotire8y ago

Can't you just enforce uniqueness anyways for that field? And then have the client retry?

1 more reply

zzzcpan8y ago

> uuid's aren't guaranteed to be unique at generation

If they are not unique, they are not really UUIDs. In which case you should tweak the algorithm to make uniqueness guaranteed. Like add a client id and its logical time in there.

1 more reply

parvenu748y ago· 7 in thread

Of course I wonder if it's too much cool and in trying to do everything it's falling short in some significant and fundamental way.

heyoni8y ago

Wow, I had no idea it could store jsonb. Any obvious advantages to using mongodb for nosql that I should consider? I’m way more comfortable with Postgres and would rather stick to that

tscs378y ago

Keep in mind that JSONB is different than JSON in PG. The later will not transform the JSON simply check consistency and has limits and performance hits.

JSONB is encoded and will transform your JSON, it will drop duplicate keys and order might not be preserved. But it's also faster and easier to use (IMO).

There is no real reason to use MongoDB tbh, PG JSONB can be handled like any other field and you can even index into JSON (partial indices even; only index rows where a field is present in the JSONB)

1 more reply

amunicio8y ago

Warning: I haven't used MongoDB in several years, so this info can be outdated!

2 more replies

DrJosiah8y ago

I have not found a situation where mongodb has been a better choice than postgresql.

1 more reply

netcraft8y ago

jsonb functionality in PG is fantastic. It does everything youd expect, plus you can do a lot of other nifty things with it that you normally cant do with relational dbs very easily.

quetzthecoatl8y ago

Storage and thus speed at the expense of memory utilisation. RAM vs HDD.

derekperkins8y ago

MySQL has close to feature parity in terms of JSON functionality too.

asah8y ago· 4 in thread

Without further ado, Postgres vomitorium cleanup features:

(from memory) CREATE VIEW foo AS SELECT *, to_timestamp(mysql_ts_col) as ts_timestamp FROM replicated_mysql_table; CREATE INDEX foo_ts_inx ON replicated_mysql_table(to_timestamp(mysql_ts_col));

At another company, we got dumped a load of JSON and weren't sure how we'd need to parse it. No biggie, I just created indices using function calls that parsed the JSON.

meritt8y ago

> EXAMPLE: I was once handed a MySQL database of IoT signals where timestamps were in seconds since the epoch and asked to report on this data without changing the database

cosmie8y ago

[1] https://www.postgresql.org/docs/current/static/indexes-expre...

[2] https://dev.mysql.com/doc/refman/5.7/en/create-table-generat...

fermuch8y ago

I read it as he had no way to alter mysql, and he needed to work with time functions, which is tricky without a real timestamp instead of an integer, and is impossible without correct indexes.

To me it seems like, in that case, he chose postgres since he could sync the data and use a custom schema over it, with proper types and indexes.

atombender8y ago

> their databases resemble vomitoriums

Curious what you mean by this. Vomitoriums had nothing to do with vomit, except in an etymological sense.

patrickg_zill8y ago· 3 in thread

I am pretty sure that PG has had clustered indexes for a decade or more... ? e.g. https://www.postgresql.org/docs/9.1/static/sql-cluster.html

Or is this term referring to a different feature/method than this?

One thing not mentioned: PL/SQL vs. whatever the MySQL equivalent is.

da_chicken8y ago

[0]: https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.h...

[1]: https://dev.mysql.com/doc/refman/8.0/en/adding-functions.htm...

[2]: https://www.postgresql.org/docs/current/static/plpgsql-overv...

[3]: https://www.postgresql.org/docs/current/static/xplang.html

dspillett8y ago

This is not the same thing - it is rearranging an existing index for efficiency as a one-off process. It needs to be repeated when the data is substantially changed.

nvivo8y ago

Also, an order by using the cluster field most probably invokes an index scan, while a clustered table doesn't.

mbell8y ago· 2 in thread

nvivo8y ago

stevenwoo8y ago

What did you do to address the mysql connection problem?

1 more reply

hodgesrm8y ago· 2 in thread

Correction: Amazon chose ParAccel, which was a data warehouse forked from PostgreSQL.

(Cross-posted from another HN link to same article.)

chrisjc8y ago

Also, inserting directly into Redshift is strongly discouraged as it's extremely non-performant.

https://aws.amazon.com/blogs/big-data/top-10-performance-tun...

olavgg8y ago

Postgresql is good for analytics, but it doesn't scale really well with a lot of data. I have moved my analytics to Clickhouse, 1000x better performance.

sufehmi8y ago· 2 in thread

What about upgrading to a newer version of PostgreSQL ? Does that still require upgrading the whole databases ?

manquer8y ago

sufehmi8y ago

Great idea - I'll try to see if this is doable in our situation. Thanks.

garyclarke278y ago

saosebastiao8y ago

Referenced in the article:

https://wiki.postgresql.org/wiki/Future_of_storage

This reads like all my dreams come true.

cat1998y ago

0 mention of language or datatype extensibility, which is huge.