undefined | Better HN

0 pointsaeonsky6y ago0 comments

Postgres has terrible indexing with json. It doesn’t keep statistics so simple queries sometimes take much longer than expected due to query planner not knowing much about the data.

0 comments

7 comments · 3 top-level

orf6y ago· 2 in thread

“It doesn’t keep statistics” is a weird way to say “I expect full table scans to always be fast”.

Create a functional index.

aeonskyOP6y ago

I’m not sure if I can even understand what you are talking about or why you said that.

If you have pid,{name:’val’, others...}. And an index of name with a million John and one Jane. Good luck getting fast results.

orf6y ago

> If you have pid,{name:’val’, others...}. And an index of name with a million John and one Jane. Good luck getting fast results.

Fast results: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=dd9370966b1528....

Postgres does actually keep statistics on json columns, but if you've got a functional index on the table and the query uses it then it doesn't matter if there is one "jane" and a million "johns". You're looking up a key in a btree index.

1 more reply

pletnes6y ago· 2 in thread

DB noob question: if you know that you should be indexing on a json attribute, can’t you put it into a «proper column» and index there?

Mister_Snuggles6y ago

There are a number of ways to do this:

* Extract the attributes you're interested in into their own columns, index these. With the extraction happening outside the database, this is the most flexible option.

* Similar to above, use a trigger to automatically extract these attributes.

* Also similar to above, used a generated column[0] to automatically extract these attributes.

* Create an index on the expression[1] you use to extract the attributes.

My use a JSON in PostgreSQL tends towards the first option. This works well enough for cases where documents are ingested and queried, but not updated. The last three options are automatic - add/change the JSON document and the extracted/indexed values are automatically updated.

[0] https://www.postgresql.org/docs/12/ddl-generated-columns.htm...

[1] https://www.postgresql.org/docs/12/indexes-expressional.html

why-el6y ago

You could, of course. But that would mean that you are effectively not using json anymore. You need to pull the data out of your json on each write, update in two places, and so on. And if you need to delete a json column, what do you do with the other one? You need to delete it also. You are then managing two things.

There is always a trade off. If the column is important enough, then you are right, it should stand on its own, but then you lose the json flexibility. I personally almost always only use jsonb if I know I only care about that overall object as a whole, and rarely need to poke around to find an exact value. As a the grandparent comment mentions, if you do need a particular value, then it might be slower if your JSON records are too different (if you think about it, how can you calculate selectivity stats on a value if you have no idea how wide or different JSON records are?).

fabian2k6y ago

I’ve seen that as well, the default estimate for jsonb can seriously confuse the query planner. There is a patch in PG13 that addresses this as far as I understand, but I’m not familiar enough with PG internals to be sure I’m reading that right. I’ll be playing with this when PG13 is out, the jsonb feature is really useful, though I wouldn’t recommend to shove relational data into it. Many things are much, much harder to query inside jsonb than regular columns.

There are ways around the statistics issue in some cases, e.g. defining a functional index on a jsonb property will collect proper statistics.

j / k navigate · click thread line to collapse