UPDATE table SET some_jsonb_column['person']['bio']['age'] = '99';
[0] https://erthalion.info/2021/03/03/subscripting/JSONB capabilities in Postgres are amazing, but the syntax is really annoying - for example, I'm forever mixing up `->` and `->>`. This new syntax feels far more intuitive.
Glad to this this super intuitive and familiar syntax added. Will make writing these updates a lot lot lot easier. Not even close!
For updates, it looks nice I guess.
But this is going to be a classic example of bad design. Databases are a bad place to be storing JSON, which is a good interface and a bad storage standard. It is pretty easy to see how JSON will play out: some bright young coder will use JSON because it is easier, then over the course of 12 months discover the benefits of a constrained schema, and then have a table-in-a-table JSON column.
It isn't so out there to think that ongoing calls for JSON support will lead Postgres to re-implement tables in JSON. We've already got people trying to build indexes on fields inside a JSON field.
This is needless complexity engineered by people who insist on relearning schemas from scratch, badly, rather than trusting the database people who say "you need to be explicit about the schema, do data modelling up front".
I think PostgreSQL has always been very pragmatic. It's supported JSON natively since 9.2 (Sep 2012).
> Databases are a bad place to be storing JSON
You're right that "mature" features and projects have a very good understand of the schema. But not everything is that.
Suppose I want to collect info from the Github API about a bunch of repos. I can just store the entire JSON response in a table and then query it at my leisure.
There's also something to be said for contiguous access. Joining tons of little records together has performance problems. Composite types and arrays can also fill this void, but they both have their own usability quirks.
That's why in 99% of cases, Postgresql uses jsonb as storage standard, which is binary and compressed.
> This is needless complexity engineered by people who insist on relearning schemas from scratch
No, this is the right tool for situations where schemas are polymorphic, fluid, or even completely absent (like raw third-party data). I love SQL and following normal forms, and it is the right tool for most situations, but not all.
If you've got hierarchical data and you just want to store, update and retrieve it as a whole (which is my use case), JSON is a good choice. Granted, it could be stored as a string/blob in my case. I don't really need to search within.
You don't even need JSONB to commit war crimes on a Postgres database. There's many things that Postgres can do, but probably shouldn't be done:
- Storing "foreign keys" in an array column, instead of using a join table
- Storing binary files as base64 encoded strings in text columns
- Using a table with `key` and `value` string columns instead of using redis
- Pub/sub using NOTIFY/LISTEN - Message queueing
- Other forms of IPC in general
- Storing executable code
- God tables
Even when trying to use Postgres appropriately, plenty of engineers don't get it right: unnecessary indices, missing indices, denormalised data, etc.
This isn't unique to Postgres, or relational databases in general. Any form of storage can and will be used to do things it's not designed or appropriate for. You can use as easily use S3 or Elasticsearch for message queuing, and can even find official guides to help you do so. Go back 20 years or so, and you can find implementations of message busses using SOAP over SMTP.
The problem isn't JSONB (or any other feature). It's bad engineering. Usually it's an incarnation of Maslow's Hammer: when all you have is a hammer, everything looks like a nail.
JSON makes perfect sense in a database that already supports all of: BLOB, TEXT, XML, ARRAY, and composite datatypes, including any datatype, including those on this list, for members of ARRAY and composites.
OTOH, Postgres has had XML since v8.2 (2006) and JSON since 9.2 (2012), and “tables in <supported structured serialization format>” hasn’t happened yet, even as discussion item AFAIK, so perhaps it would be bad, but even so it seems to be just fantasizing something to worry about.
The reason is with some projects/data it's hard to be explicit about the schema which is why NoSQL had it's popularity phase.
Now most applications don't have either entirely structured or entirely unstructured data, they will have a mix - so it's absolutely brilliant for one tool to do both. If they didn't support JSON I have a strong suspicion that they wouldn't have had some of the growth we have seen for Postgres across the last few years.
Do any other entrenched software projects come to mind? The only thing comparable I can think of are Git and Linux.
There's been a lot of exciting work in this area over the last decade or so. Andy Pavlo's classes are great surveys of the latest work: https://15721.courses.cs.cmu.edu/spring2020/
CosmosDB is an example of a relational (multi paradigm properly) database with a quite different architecture vs the classic design, that's moved into production status quite rapidly.
FaunaDB and CockroachDB are moving with solid momentum too.
- scaling is non-trivial (you can't just add a node and have PostgreSQL automagically Do The Right Thing™)
- you can only have so many connections open to the database, causing issues with things such as AWS Lambda
- I don't remember if this was changed, but I got the impression a while ago that having dynamic DB users was a bit cumbersome to set up (plug PostgreSQL to AD/LDAP)
All of these require a different architecture so expect to see newer databases push things even further.
Even Croach would be a massive branding improvement.
This is similar to how gimp is a terrible brand.
The machine being used for this benchmark has 96 vCPUs, 192G of RAM, and costs $3k/mo.
My business runs just fine on a 3.75G, 1 vCPU instance. But idle connections eat up a huge amount of RAM and I sometimes find myself hitting the limits when a load spike spins up extra frontend instances.
Sure I could probably setup pgbouncer and some other tools but that's a lot of headache. I'm acutely aware that MySQL (which I dislike because no transactional DDL) does not suffer from this issue. I also don't see this being solved without a major rewrite, which seems unlikely.
So Postgres has at least one very serious fault that makes room in the marketplace. The poor replication story is another.
As for the per-connection memory usage, the big question is whether there really is a problem (and perhaps if there's a reasonable workaround). It's not quite clear to me why you think the issues in your case are are due to idle connections, but OK.
There are two things to consider:
1) The fixed per-connection memory (tracking state, locks, ..., a couple kBs or so). You'll pay this even for unused connections.
2) Per-process memory (each connection is handled by a separate thread).
It's difficult to significantly reduce (1) because that state would no matter what the architecture is, mostly. Dealing with (2) would probably require abandoning the current architecture (process per connection) and switching to threads. IMO that's unlikely to happen, because:
(a) the process isolation actually a nice thing from the developer perspective (less locking, fewer data races, ...)
(b) processes work quite fine for reasonable number of long-lived connections, and for connection pools address a lot of the other cases
(c) PostgreSQL supports a lot of platforms, some of which may not may not have very good multi-threading support (and supporting both architectures would be quite a burden)
But that's just my assessment, of course.
Elasticsearch is underrated here, IMO. Yes, there are alternatives for simple fulltext search. But there’s a lot more it can do (adhoc aggregations incorporating complex fulltext searches, with custom scripted components; geospatial; index lifecycle management) and if you’re using those features, there’s nothing else comparable.
It’s pretty stable, too, once you’ve got the cluster configured. We don’t have outages due to problems with Elasticsearch.
I’m currently evaluating typesense vs ES for a fts project and typesense is winning so far by simply be “not painful” to deal with.
They are great in some cases and terrible in others, and over time, use cases push database systems into their worst cases. Use cases rarely stay in the sweet spot of a special-purpose system.
That being said, if the integration is great, and/or the special system is a secondary one (fed from a general-purpose system), then it's often fine.
It’s totally understandable that you’d need developers to have expertise in patterns and anti-patterns, as well as needing an expert to set things up in the first place, but you shouldn’t have to have a dedicated ES monitoring / tuning / babysitting team like Oracle DBAs of yore. That you do, means it isn’t there yet as a product.
SQLite.
But saying "so much better" is too strong.
https://www.postgresql.org/docs/devel/btree-implementation.h...
One benchmark involving a mix of queue-like inserts, updates, and deletes showed that it was practically 100% effective at controlling index bloat:
https://www.postgresql.org/message-id/CAGnEbogATZS1mWMVX8FzZ...
The Postgres 13 baseline for the benchmark/test case (actually HEAD before the patch was committed, but close enough to 13) showed that certain indexes grew by 20% - 60% over several hours. That went down to 0.5% growth over the same period. The index growth much more predictable in that it matches what you'd expect for this workload if you thought about it from first principles. In other words, you'd expect about the same low amount of index growth if you were using a traditional two-phase locking database that doesn't use MVCC at all.
Full disclosure: I am the author of this feature.
MySQL and Oracle exist. Mercurial and perforce exist. I'm not sure it's a terrible stretch to compare git and postures.
There's almost no reason to pick MySQL for a new project.
Excel for business spreadsheets.
Java for enterprise server software.
Not an expert, but it is my understanding that Julia is becoming an ever more serious competitor day by day.
> Excel for business spreadsheets.
Honest question, what does LibreOffice miss compared to Excel? In any case, (again not an expert) spreadsheets seem quite inferior to a combination of Julia, CSV and Vega (Lite); although there are certainly more people that are familiar with operating Excel.
Big corporations are horribly inefficient and Enterprise Software necessarily so from that...if you're saying Java is terrible by nature of it being the goto for enterprise, then that makes sense. It took 20 years for it to swap places with COBOL and I expect it will be something else in 20 more.
Last time I was responsible for setting up a HA Postgres cluster it was a garbage fire, but that was nearly 10 years ago now. I ask every so often to see if it has improved and each time, so far, the answer has been no.
Otherwise use MySQL, Oracle, MongoDB, Cassandra etc if you want to run it on your own.
Any other database that invested in a native and supported HA/clustering implementation.
Logical for streaming to read only replicas and streaming for fail-over. My client-app still needs to know try-A then try-B (via DNS or config)
I manage both a cockroachdb cluster and a few PG setups. Out postgres' have streaming replication to a standby with barman running on the standby. They are night and day.
Sure 2021 PG is way better than 2010 PG. But relative to available options, it's much worse.
And 200+ other improvements in the Postgres 14 release!
These are just some of the many improvements in the new Postgres release. You can find more on what's new in the release notes, such as:
The new predefined roles pg_read_all_data/pg_write_all_data give global read or write access
Automatic cancellation of long-running queries if the client disconnects
Vacuum now skips index vacuuming when the number of removable index entries is insignificant
Per-index information is now included in autovacuum logging output
Partitions can now be detached in a non-blocking manner with ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
the killing of queries when the client disconnects is really nice imo -- the others are great tooI see no mention of addressing transaction id wraparound, but these are in the release notes:
Cause vacuum operations to be aggressive if the table is near xid or multixact wraparound (Masahiko Sawada, Peter Geoghegan)
This is controlled by vacuum_failsafe_age and vacuum_multixact_failsafe_age.
Increase warning time and hard limit before transaction id and multi-transaction wraparound (Noah Misch)
This should reduce the possibility of failures that occur without having issued warnings about wraparound.
Clearly it doesn't eliminate the possibility of wraparound failure entirely. Say for example you had a leaked replication slot that blocks cleanup by VACUUM for days or months. It'll also block freezing completely, and so a wraparound failure (where the system won't accept writes) becomes almost inevitable. This is a scenario where the failsafe mechanism won't make any difference at all, since it's just as inevitable (in the absence of DBA intervention).
A more interesting question is how much of a reduction in risk there is if you make certain modest assumptions about the running system, such as assuming that VACUUM can freeze the tuples that need to be frozen to avert wraparound. Then it becomes a question of VACUUM keeping up with the ongoing consumption of XIDs by the system -- the ability of VACUUM to freeze tuples and advance the relfrozenxid for the "oldest" table before XID consumption makes the relfrozenxid dangerously far in the past. It's very hard to model that and make any generalizations, but I believe in practice that the failsafe makes a huge difference, because it stops VACUUM from performing further index vacuuming.
In cases at real risk of wraparound failure, the risk tends to come from the variability in how long index vacuuming takes -- index vacuuming has a pretty non-linear cost, whereas all the other overheads are much more linear and therefore much more predictable. Having the ability to just drop those steps if and only if the situation visibly starts to get out of hand is therefore something I expect to be very useful in practice. Though it's hard to prove it.
Long term, the way to fix this is to come up with a design that doesn't need to freeze at all. But that's much harder.
It's a pity this wasn't listed in the announcement as I think alot of people are interested in this issue.
>> Long term, the way to fix this is to come up with a design that doesn't need to freeze at all.
Do you know if anyone is turning their attention to this or is it not currently being tackled by anyone?
Sweet! I often screw up a query and need to cancel it with
pg_cancel_backend(pid)
because Ctrl-C rarely works. With this I can just ragequit and reconnect. Sweet!The problem usually isn't that it doesn't work ever, just that it can take a very long time, especially if the query is reading some crazy amount of data. I've always found pg_cancel_backend() to be almost instant though.
Episode website: https://www.dataengineeringpodcast.com/postgresql-data-wareh...
Direct: (apple) https://podcasts.apple.com/us/podcast/data-engineering-podca...
In general I find postgres "just works" a lot more than MySQL. MySQL has a really bad habit of sticking with bad defaults for a long period, while having better configuration available. On the other hand postgres devs actively remove/change defaults so you're always getting the best it has to offer.
If you pick one, and you don't like it there are plenty of tools to change between them. If you're curious you could even deploy both of them.
IMO, the biggest shock from the MSSQL/MYSQL to PostgreSQL migration was not having 1 or 2 specific files per database, specially if you used to backup the files instead of doing a formal database backup.
Postgres by itself doesn't have a great horitzontal scaling strategy as of now I think. You need Citus or somt like that on top, maybe your friend was referencing that?
[1,2] https://www.postgresql.org/docs/13/routine-vacuuming.html https://www.postgresql.org/docs/current/planner-stats.html
The main issue we get is the 1 connection = 1 process issue although there are ways to mitigate that (namely pgbouncer).
Both are a degree more difficult than NoSQL. The main issue is maintaining schema's
nowadays postgres in the cloud does all of this for you.
And don't take my word for it, see for yourself here:
https://en.wikipedia.org/wiki/Comparison_of_relational_datab...
And MySQL is an Oracle product these days, go with MariaDB instead as this one is a MySQL fork made by the original papa of MySQL.
If I have a Django + PG query that takes 1 second and I want to deeply inspect the breakdown of that entire second, where might I begin reading to learn what tools to use and how?
Take the result of this and paste it into https://explain.depesz.com/
which will make it human readable.
Understanding this is sometimes very easy, but if you want to understand what they _really_ mean, you can read depesz.com
It covers all major databases and is a good start to dive into database interna and how to interpret output from query analyzers.
Other than that, I highly recommend joining the mailing list and IRC (#postgresql on libera.chat).
Lots of valuable tricks being shared there by people with decades of experience.
It can also be used with Django Rest Framework via the browsable api.
May be parent is looking for deeper insight than this but it is useful to do quick visual query inspection.
django debug toolbar (or similar) should be the first thing you go to because these tools understand the django ORM well.
the other thing that comes to mind is enabling query timing in your django shell. i believe you might need an extension for this.
then you can look at the postgres itself. but i would keep it at the django layer at first because it might reveal something about the ORM.
Note that this isn't valid SQL, just an approximation, because Django doesn't generate a single SQL string, but uses the underlying library's parameterization. So you'll have to fiddle with quotes and such to get SQL you can run the EXPLAIN on that's mentioned in the other replies.
Instead you are likely to be forced to use a cloud hosted PostgreSQL instance in order to get HA/clustering.
The concepts behind Vitess are sufficiently general to simply apply them to PostgreSQL now that PostgreSQL has logical replication. In some ways it can be even better due to things like replication slots being a good fit for these sorts of architectures.
The work to port Vitess to PostgreSQL is quite substantial however. Here is a ticket tracking the required tasks at a high level: https://github.com/vitessio/vitess/issues/7084
I know this is not an optimized SQL. But this takes about 5 seconds in Postgre while the same command runs in milliseconds in MSSQL Server. The APCRoleTableColumn has only about 5000 records. The above query is to delete all columns not present in the schema from the APCRoleTableColumn table
I used to be a heavy MSSQL user. I do love Postgre and have switched over to using it in all my projects and am not looking back. I wish it was as performant as MSSQL. This is just one example. I can list a number of others too.
If you don't need the NOT IN weirdness around NULL values then I'd suggest you just use a NOT EXISTS. That'll allow something more efficient like a Hash Anti Join to be used during the DELETE. Something like:
Delete From "APCRoleTableColumn" Where Not EXISTS (Select 1 From information_schema.columns SC INNER JOIN "APCRoleTable" RT ON SC.table_name = RT."TableName" Where RT."TableName" = "APCRoleTableColumn"."TableName" AND SC.column_name = "APCRoleTableColumn"."ColumnName" AND SC.table_schema = 'public');
Is that faster now?
As others have said, explain analyze will show you what’s going on. I’m fairly sure this query would be fixed by flipping and / or adding an index. 5k records is nothing to pg.
In an ideal world PostgreSQL would handle infinite number of connections without a connection pool. Unlikely in practicem though.
There are good practical reasons to actually limit the number of connections:
(a) CPU efficiency (optimal number of active connections is 1-2x number of cores)
(b) allows higher memory limits
(c) lower risk of connection storms
(d) ... probably more
Some applications simply ignore this and expect rather high number of connections, with the assumption most of them will be idle. Sometimes the connections are opened/closed frequently, making it worse.
Eliminating the need for a connection pool in those cases would probably require significant changes to the architecture, so that e.g. forking a process is not needed.
But my guess is that's not going to happen. A more likely solution is having a built-in connection pool which is easier to configure / operate.
Separate connection pools (like pgbouncer) are unlikely to go away, though, because being able to run them on a separate machine is a big advantage.
- 12-10-2020: "Most regression tests are passing, but write-speeds are still low."