Improving Postgres text search speed (opens in new tab)

(charityapi.org)

131 pointskuzee3y ago57 comments

57 comments

47 comments · 9 top-level

eric4smith3y ago· 8 in thread

PostgreSQL text search is awesome - for English and Roman type languages.

But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.

That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.

asah3y ago

Support for i18n: https://www.google.com/search?q=pgroonga

eric4smith3y ago

Yeah we use it.

It’s however a database within a database. PostgreSQL does not natively text search on these kinds of languages.

jxf3y ago

> But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

I understand that there are likely to be many significant differences in graphemes in these languages from Roman text but I'm not familiar enough with any Asian language to construct an example. Can you give an illustrative example that explains why the search doesn't work as well (or at all) in those cases?

devonkim3y ago

Not intimately familiar with how text search works in indexing but in most Romanized / Latin script text determiners, articles, etc. are space separated from the nouns which can be confusing and introduce state into queries due to the need to perform some splitting within character sequences. This isn’t the same thing as finding the roots of words / stemming for fuzzy search purposes either. “짬뽕이 맛있습니다” has a plain noun 짬뽕 with case marking via -이 and the ending copula is parseable as a run on phrase but Finnish has case marking without space separation too and doesn’t seem to be cited as a parse / representation problem last I saw. In English it’s “the 짬뽕 is delicious” where noun is obvious and if you split by spaces you can quickly throw away “the” and “is” while it’s not clear in the Korean until you check for the case marker and prior glyphs for a parse. Now, where I think there can be issues is in Unicode glyph representations where multiple codes can wind up to the same symbol.

wahnfrieden3y ago

no spaces in japanese, it's tricky to determine word boundaries and conjugations and multiple "spellings" of the same word. the libs that tokenize and de-inflect languages are usually highly specialized technologies for east asian languages (maybe others but i'm particularly familiar with japanese, korean, chinese)

eric4smith3y ago

No spaces in Thai also. Difficult to discern word boundaries.

SnowHill99023y ago

Can you give more information about your JOINs performance? It’s mostly dependent on the presence and optimization of INDEXes and cardinality as you mention. JOIN ON indexed integers is usually fastest.

eric4smith3y ago

It’s not such an easy thing to describe.

Yes we join on indexed integers too.

The thing is, we believe as the amount of data changes the planner will give different instructions. Even after lots of analyzing and vacuuming.

It’s happened so many times and so randomly and because of growth we really have to re-examine joins that were very fast 6 months ago that are now really slow.

zasdffaa3y ago· 7 in thread

> A key gotcha that tripped us up: when querying with a list of primary keys, Postgres will not return the records in the same order as the IDs provided in a “where … in” clause.

That's frightening they don't know that. So, burn this into your minds: no ordering is guaranteed in SQL unless an ORDER BY is applied to it (and any ordering in a subquery is lost). Even if it seems to work, it will fail. No guarantees on order unless... scratch that onto your eyeballs so you never forget it.

Also, will people please stop posting rainbow-on-black screenshots, especially with the screenshotted text size smaller than the main text.

silvestrov3y ago

Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries (or appends a "random()" clause to an existing "order by" clause).

Or even better: have it as a connection/session parameter.

zasdffaa3y ago

> Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries

interesting! I like it.

> or appends a "random()" clause to an existing "order by" clause

That one I don't understand. If the Order By is there, why randomise it? I guess

   order by p, q, r, random()

then it sort of allows you to find out if it's been ordered sufficiently for your needs - is that it?

1 more reply

jonatron3y ago

I know this now, but when I didn't and when Postgres returned the same order thousands of times, then changed for no apparent reason, it made fixing flaky tests very tricky.

irjustin3y ago

> That's frightening they don't know that

It's a common problem when moving from Mysql. We personally ran into that shortly after switching to Postgres, but seems like something they should've discovered a lot earlier for sure.

kuzeeOP3y ago

We shared the mistake so others don't accidentally make the same assumption; I'm sure a few people either learned it for the first time or appreciated the reminder. Rest easy this never made it to prod.

zasdffaa3y ago

Thanks. I have a lot of time for those who are willing to admit their mistakes.

icedchai3y ago

This does seem fundamental.

jrochkind13y ago· 7 in thread

How can querying for PK's than doing a second query on the FK be faster than a join that's semantically equivalent? Postgres optimizer gone terribly wrong or something? Or am I misunderstanding what's going on?

It would never have occurred to me to even try this, I would have assumed postgres would do as well as could be done at what looks like a very standard join. So I guess that's a lesson not to make assumptions?

Not sure, without even having a theory of why pg would behave this way, I'd be worried that as the data changes it might regress again. Or if the phase of the moon changes or something -- since it seems like magic!

Olreich3y ago

It’s almost never faster to do two queries. But DBs are complicated query engines with sophisticated compilers, so you have to go in and try to convince the compiler to optimize the way you want that will make the query fast. EXPLAIN_ANALYZE is your best friend for understanding what the compiler thinks is best. And then you can adjust the query to convince it that it should do something faster.

In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.

zasdffaa3y ago

It's a failure of the optimiser. AS a DB guy (MSSQL, not postgres), there's always a risk of this, you always have to be aware of this, and don't assume. MSSQL's optimiser is pretty good but I've also seen slowdowns on plain queries after adding an index. Optimisers are complex and individual optimisation rules interact in odd and sometimes counterproductive ways. Also if your stats are not up-to-date, trouble is guaranteed.

bastawhiz3y ago

You're right. I had the same thought.

Frankly, it's such a simple case that I'm hesitant to pin the problem on Postgres. I'm now inclined to believe that the ORM/query builder they're using (which I'm not familiar with) is generating a query that does something unusual. I'm also having a hard time understanding the code, so it may just be a bad query in the first place. If the generated SQL was shown, it would be much easier to look into.

Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.

The lesson here is to always look at EXPLAIN ANALYZE any time you're doing Postgres optimization work. There's a wealth of useful information that will often point you to your mistake. Databases are quite often extremely good at doing common-sense operations well.

kuzeeOP3y ago

We secretly agree and aren't sure why it's faster to do two queries, but it measurably is. We're going to try some of the suggestions littered in this conversation and will report back, this time with some EXPLAIN output. We appreciate the suggestions and theories.

Groxx3y ago

While I generally wonder the same question... I've seen basically every database follow this pattern. There are fairly frequent scenarios where doing a couple separate indexed queries performs better than an equivalent join. Particularly when queries get kinda-large and up.

I can certainly see it when running into memory limits per table, because doing the full join might require more memory at once, but it frequently happens far below where I'd expect that to occur (in the low thousands of rows). Dunno why. Maybe it uses caches more efficiently? Many simple operations are often more cache-able than a few complex ones.

masklinn3y ago

Maybe disk spill because some joins require a superlinear amount of space and doesn’t fit in memory buffers?

rubyist5eva3y ago

I’ve seen it happen when people forget to index the foreign key columns, it also may be faster on one to many joins if you need to use DISTINCT to remove duplicate rows (instead of joining on a specific subquery that eliminates the duplication implicitly).

vosper3y ago· 5 in thread

PGSync might be useful for those who don't mind also running Elasticsearch

https://github.com/toluaina/pgsync

kabes3y ago

In that case you may also want to take a look at https://github.com/zombodb/zombodb

pbronez3y ago

Oh that’s super neat! Looks like a more elegant way to link Postgres and Elasticsearch:

“From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.”

avinassh3y ago

This is so neat! How does it handle updates? Are there similar systems which can sync with Redis for quick caching?

netcraft3y ago

Thank you for sharing. Does anyone know of something similar for dynamodb?

simlevesque3y ago

Indexing dynamodb with elasticsearch ? Amplify makes it super easy.

https://docs.amplify.aws/cli-legacy/graphql-transformer/sear...

1 more reply

norwalkbear3y ago· 4 in thread

Has anyone overcome the 16382 positional limits of tsvector?

That and the automatic stemming and lemming of search words even in phrase searches makes postgres awful for any software where accurate search is critical.

hombre_fatal3y ago

I remember when I ran into that first issue on my forum, I felt like I was the first person to ever use Postgres full-text search. (8?) Years ago, googling the exact error brought up nothing except some dev email/listserv chatter. Nobody else was indexing longer text documents? Wasn't very encouraging.

Oh well. The vast majority of forum posts don't hit that limit so I just excluded the exceptionally long posts from the index. I never revisited it again.

jmull3y ago

I don’t know ab out the positional limits, but the built-in stemming and other transformations are a default, which you can change. I think this is the relevant mechanism: https://www.postgresql.org/docs/current/textsearch-dictionar...

iav3y ago

I used a trigger function to detect long text and trim the source before it got indexed. That meant that any text over 500kb just got dropped from the index. I also used one index per long text field rather than combining with other fields.

manigandham3y ago

The lack of BM25 relevance scoring is the bigger problem. Postgres FTS is fine as a better "LIKE" filter with very little overhead, but it's a poor choice for serious search applications or scale.

rdoherty3y ago· 4 in thread

I'd love to see some details on the why using EXPLAIN ANALYZE on each query and schema. It seems like the changes were done with a hunch as to why they were slow?

hinkley3y ago

> It seems like the changes were done with a hunch as to why they were slow?

You say that like it’s a bad thing. If the hunch works does that cheapen the effect? The why can come after. It often does.

Mastery isn’t pondering things faster, it’s using system 1 thinking for most of the process with some system 2 sprinkled on top. Or to use different terminology, intuition with a bit of high level executive function guiding things.

Or a lot of hunches and a little thinking.

dealforager3y ago

If you're discussing performance of SQL queries, showing the output of EXPLAIN ANALYZE is the bare minimum. There's too many variables that can affect performance and if you can't see what's happening under the hood it's not very useful.

1 more reply

nl3y ago

It's a good point though. EXPLAIN should point exactly what was going wrong in their first attempt (where it was slower).

Postgres should be able to do that join in a single query faster than two queries with the latency in between.

So why isn't it?

2 more replies

stickfigure3y ago

Sure, you could spend all day guessing what will improve the query or you can just add two words to your query and postgres will tell you. EXPLAIN ANALYZE is the bare minimum competent effort required to tune queries.

I use postgres FTS heavily on large datasets, have done a lot of performance tuning, so I was excited by the title. Unfortunately the article failed to deliver any useful information.

1 more reply

anshul3y ago· 2 in thread

You might want to test using the first query as a sub-query or cte in the second one. That would likely give you the same / better perf. It would avoid the join and save a round trip.

hobs3y ago

If you dont need the output of the first query you'll almost always have the best performance in sql using exists syntax eg

select * from query1 as q where exists ( select * from query2 as q2 where q.col = q2.col )

nfcampos3y ago

Yes although in this specific case you’d lose the order of the results with exists

kuzeeOP3y ago· 1 in thread

I wrote up my recent experience optimizing Postgres text search on a database with a few million records without relying on a new service like Elastic Search.

canadiantim3y ago

Awesome thanks for the write up. Definitely cool to see breaking up the query into 2 queries had such a marked improvement! I know I can often get too obsessed with making just a single query. Cheers.

maverwa3y ago

I remember achieving some quite good results when implementing a simple „fuzzyesque“ search for some B2B e-commerce system a few years back, but what hit us hard was the german dictionary used for stemming for the ts_vector index. The original one coming with psql did not support german compound words correctly and the „fix“ was to import a 3rd party one. I learned the hard ways that this comes at the cost of Postgres loading the dictionary every time a new connection uses the index. Running a simple, small php app that could not reuse/pool connections, every search query done was coming in on a clean connection and hit a >1s fee for having to load the dict.

iirc pgbouncer saved the day there.

Otherwise it worked fine.

j / k navigate · click thread line to collapse

57 comments

47 comments · 9 top-level

eric4smith3y ago· 8 in thread

PostgreSQL text search is awesome - for English and Roman type languages.

But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.

That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.

asah3y ago

Support for i18n: https://www.google.com/search?q=pgroonga

eric4smith3y ago

Yeah we use it.

It’s however a database within a database. PostgreSQL does not natively text search on these kinds of languages.

jxf3y ago

> But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

devonkim3y ago

wahnfrieden3y ago

eric4smith3y ago

No spaces in Thai also. Difficult to discern word boundaries.

SnowHill99023y ago

eric4smith3y ago

It’s not such an easy thing to describe.

Yes we join on indexed integers too.

The thing is, we believe as the amount of data changes the planner will give different instructions. Even after lots of analyzing and vacuuming.

It’s happened so many times and so randomly and because of growth we really have to re-examine joins that were very fast 6 months ago that are now really slow.

zasdffaa3y ago· 7 in thread

> A key gotcha that tripped us up: when querying with a list of primary keys, Postgres will not return the records in the same order as the IDs provided in a “where … in” clause.

Also, will people please stop posting rainbow-on-black screenshots, especially with the screenshotted text size smaller than the main text.

silvestrov3y ago

Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries (or appends a "random()" clause to an existing "order by" clause).

Or even better: have it as a connection/session parameter.

zasdffaa3y ago

> Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries

interesting! I like it.

> or appends a "random()" clause to an existing "order by" clause

That one I don't understand. If the Order By is there, why randomise it? I guess

   order by p, q, r, random()

then it sort of allows you to find out if it's been ordered sufficiently for your needs - is that it?

1 more reply

jonatron3y ago

I know this now, but when I didn't and when Postgres returned the same order thousands of times, then changed for no apparent reason, it made fixing flaky tests very tricky.

irjustin3y ago

> That's frightening they don't know that

It's a common problem when moving from Mysql. We personally ran into that shortly after switching to Postgres, but seems like something they should've discovered a lot earlier for sure.

kuzeeOP3y ago

zasdffaa3y ago

Thanks. I have a lot of time for those who are willing to admit their mistakes.

icedchai3y ago

This does seem fundamental.

jrochkind13y ago· 7 in thread

Olreich3y ago

In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.

zasdffaa3y ago

bastawhiz3y ago

You're right. I had the same thought.

Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.

kuzeeOP3y ago

Groxx3y ago

masklinn3y ago

Maybe disk spill because some joins require a superlinear amount of space and doesn’t fit in memory buffers?

rubyist5eva3y ago

vosper3y ago· 5 in thread

PGSync might be useful for those who don't mind also running Elasticsearch

https://github.com/toluaina/pgsync

kabes3y ago

In that case you may also want to take a look at https://github.com/zombodb/zombodb

pbronez3y ago

Oh that’s super neat! Looks like a more elegant way to link Postgres and Elasticsearch:

avinassh3y ago

This is so neat! How does it handle updates? Are there similar systems which can sync with Redis for quick caching?

netcraft3y ago

Thank you for sharing. Does anyone know of something similar for dynamodb?

simlevesque3y ago

Indexing dynamodb with elasticsearch ? Amplify makes it super easy.

https://docs.amplify.aws/cli-legacy/graphql-transformer/sear...

1 more reply

norwalkbear3y ago· 4 in thread

Has anyone overcome the 16382 positional limits of tsvector?

That and the automatic stemming and lemming of search words even in phrase searches makes postgres awful for any software where accurate search is critical.

hombre_fatal3y ago

Oh well. The vast majority of forum posts don't hit that limit so I just excluded the exceptionally long posts from the index. I never revisited it again.

jmull3y ago

iav3y ago

manigandham3y ago

The lack of BM25 relevance scoring is the bigger problem. Postgres FTS is fine as a better "LIKE" filter with very little overhead, but it's a poor choice for serious search applications or scale.

rdoherty3y ago· 4 in thread

I'd love to see some details on the why using EXPLAIN ANALYZE on each query and schema. It seems like the changes were done with a hunch as to why they were slow?

hinkley3y ago

> It seems like the changes were done with a hunch as to why they were slow?

You say that like it’s a bad thing. If the hunch works does that cheapen the effect? The why can come after. It often does.

Or a lot of hunches and a little thinking.

dealforager3y ago

1 more reply

nl3y ago

It's a good point though. EXPLAIN should point exactly what was going wrong in their first attempt (where it was slower).

Postgres should be able to do that join in a single query faster than two queries with the latency in between.

So why isn't it?

2 more replies

stickfigure3y ago

I use postgres FTS heavily on large datasets, have done a lot of performance tuning, so I was excited by the title. Unfortunately the article failed to deliver any useful information.

1 more reply

anshul3y ago· 2 in thread

You might want to test using the first query as a sub-query or cte in the second one. That would likely give you the same / better perf. It would avoid the join and save a round trip.

hobs3y ago

If you dont need the output of the first query you'll almost always have the best performance in sql using exists syntax eg

select * from query1 as q where exists ( select * from query2 as q2 where q.col = q2.col )

nfcampos3y ago

Yes although in this specific case you’d lose the order of the results with exists

kuzeeOP3y ago· 1 in thread

I wrote up my recent experience optimizing Postgres text search on a database with a few million records without relying on a new service like Elastic Search.

canadiantim3y ago

maverwa3y ago

iirc pgbouncer saved the day there.

Otherwise it worked fine.

j / k navigate · click thread line to collapse