Tantivy – full-text search engine library inspired by Apache Lucene (opens in new tab)

(github.com)

333 pointskaathewise2y ago58 comments

58 comments

45 comments · 19 top-level

kaathewiseOP2y ago· 7 in thread

I was searching for a Meilisearch alternative (which sends out telemetry by default) and found Tantivy. It's more of a search engine builder, but the setup looks pretty simple [0].

[0]: https://github.com/quickwit-oss/tantivy-cli

ukuina2y ago

QuickWit also sends telemetry by default: https://quickwit.io/docs/telemetry

OtomotO2y ago

Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

The java sdk of meilisearch was also nice, same: no need for a cli and manual configuration. I just pointed it to a db entity and indexed whole tables...

Would love that for tantivy

PSeitz2y ago

> Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

Yes that's how you use tantivy normally, not sure which json config you mean.

tantivy-cli is more like a showcase, https://github.com/quickwit-oss/tantivy is the actual project.

1 more reply

banish-m42y ago

That's a petty objection to usable interactive search when it's easy to opt-out by adding a single command line argument.

soulofmischief2y ago

OP is entitled to make political choices when selecting software.

Some of us have specific principles of which things like opt-out telemetry might run afoul.

OP will choose their software, I choose mine and you choose yours; none of us need to call each other petty or otherwise cast such negative judgement; a free market is a free market.

1 more reply

kaathewiseOP2y ago

It's a minor complaint, but I'm also evaluating it for a minor project. I just don't like the fact that I can forget to add a flag once and, oh, now I'm sending telemetry on my personal medical documents.

1 more reply

Nathanba2y ago

also meilsearch is more like quickwit, their distributed offering but quickwit is AGPL

1 more reply

leyoDeLionKin2y ago· 6 in thread

but y not just a vector database like pgvector?

kernelsanderz2y ago

In practice, a combination of full text and vector databases often gives superior performance than just one of the types. It's called hybrid search. Here's an article that talks a bit about this: https://opster.com/guides/opensearch/opensearch-machine-lear...

Often you take the results from both vector search and lexical search and merge them through algorithms like Reciprocal Rank Fusion.

teraflop2y ago

You can think of a full-text index as being like a vector database that's highly specialized and optimized for the use-case where your documents and queries are both represented as "bags of words", i.e. very high-dimensional and very sparse.

Which works great when you want to retrieve documents that actually contain the specific keywords in your search query, as opposed to using embeddings to find something roughly in the same semantic ballpark.

1 more reply

demilich2y ago

Check https://github.com/infiniflow/infinity which combines vector search and full-text search providing extremely fast search performance.

jasfi2y ago

Infinity looks interesting, but I don't see any mention of support for clustering.

1 more reply

CuriouslyC2y ago

Vector databases are good for documents, but if you have a fact database or some other more succinct information store, it's quite slow to retrieve compared to trigram/full text while often performing worse.

FridgeSeal2y ago

Because it’s a full text search engine, and not a text embedding? Different query types, requirements, indexing methods, etc.

mmastrac2y ago· 5 in thread

Major props to the authors of this library. I re-built https://progscrape.com [1] on top of it last year, replacing an ancient Python2 AppEngine codebase that I had neglected for a while. It's a great library and insanely fast, as in indexing the entire library of 1M stories on a Raspberry Pi in seconds.

I'm able to host a service on a Pi at home with full-text search and a regular peak load of a few rps (not much, admittedly), with a CPU that barely spikes above a few percent. I've load tested searches on the Pi up to ~100rps and it held up. I keep thinking I should write up my experiences with it. It was pretty much a drop-in, super-useful library and the team was very responsive with bug reports, of which there were very few.

If you want to see how responsive the search is on such a small device, try clicking the labels on each story -- it's virtually instantaneous to query, and this is hitting up to 10 years * 12 months of search shards! https://progscrape.com/?search=javascript

I'd recommend looking at it over Lucene for modern projects. I am a big fan, as you might be able to tell. Given how well it scales on a tiny little ARM64, I'd wager your experiences on bigger iron will be even more fantastic.

[1] https://github.com/progscrape/progscrape

snorremd2y ago

It is a very nice library. I’m using it for a very work in progress incremental email backup CLI tool for email providers using JMAP.

I wanted users to be able to search their backups. As I’m using Rust Tantivy looked like just the right thing for the job. Indexing happens so fast for an email I did not bother to move the work to a separate thread. And search across thousands of emails seems to be no problem.

If anyone wants search for their Rust application they should take a look at Tantivy.

CaptainOfCoit2y ago

Tiny bug report: https://progscrape.com/?search=grep shows "Error: PersistError(UnexpectedError("Storage fetch panicked"))"

mmastrac2y ago

It looks like there was a bug with certain search queries that wedged a mutex because they failed to parse on my end. Deploying a fix now. Thanks!

OtomotO2y ago

Thanks for that! A couple of days ago I used meilisearch for a quick proof of concept, but I'll check out tantivy again via your repo.

I basically just need a fulltext search.

worble2y ago

If you just need full text search, assuming you're already using Postgres you can get quite far just using it's own primitives

https://www.postgresql.org/docs/current/textsearch.html

https://www.crunchydata.com/blog/postgres-full-text-search-a...

1 more reply

adeptima2y ago· 2 in thread

Found recently Tantivy inside of ParadeDB (Postgres extension aiming to replace Elastic)

https://github.com/paradedb/paradedb/blob/dev/pg_search/Carg...

after listening

Extending Postgres for High Performance Analytics (with Philippe Noël) https://www.youtube.com/watch?v=NbOAEJrsbaM

And inside of the main thing - Quickwit(logs, traces, and soon metrics) https://github.com/quickwit-oss/quickwit

Had a surprisingly good experience with combined power of Quickwit and Clickhouse for multilingual search pet project. Finally something usable for Chinese, Japanese, Korean

https://quickwit.io/docs/guides/add-full-text-search-to-your...

to_tsvector in PG never worked well for my use cases

SELECT * FROM dump WHERE to_tsvector('english'::regconfig, hh_fullname) @@ to_tsquery('english'::regconfig, 'query');

Wish them to succeed. Will automatically upvote any post with Tantivy as keyword

fulmicoton2y ago

Thank you so much for sharing!!!

tarasglek2y ago

That's a cool design pattern combining url/rest based index and doing the search query entirely within sql. Can do same thing in postgres fdw

throwaway815232y ago· 2 in thread

This is nice, I used Solr for a while and it worked well but I hated the Java underneath it, and some aspects of it seemed needlessly slow. But, I think this is still a 20th century style of search engine and we need more modern approaches. Especially, those of us with small datasets compared to internet search behemoths can probably take an effiency hit to get more useful results.

elric2y ago

Why did you "hate the Java underneath it"?

beanjuiceII2y ago

because it wouldn't let their power level reach 9000

tylerkovacs2y ago· 1 in thread

I recently deployed Quickwit (based on Tantivy, from the same team) in production to index a few billion objects and have been very pleased with it. Indexing rates are fantastic. Query latency is competitive.

Perhaps most importantly, separation of compute and storage has proven invaluable. Being able to spin up a new search service over a few billion objects in object storage (complete with complex aggregations) without having to pay for long-running beefy servers has enabled some new use cases that otherwise would have been quite expensive. If/when the use case justifies beefy servers, Quickwit also provides an option to improve performance by caching data on each server.

Huge bonus: the team is very responsive and helpful on Discord.

fulmicoton2y ago

Thank you @tyler!!!

yencabulator2y ago· 1 in thread

Beware, you still cannot add/remove fields: https://github.com/quickwit-oss/tantivy/issues/470

The only way to add fields is to reindex all data into a different search index.

francoismassot2y ago

One workaround is to use the JSON field, see doc https://github.com/quickwit-oss/tantivy/blob/main/doc/src/js...

yodaarjun2y ago· 1 in thread

Tantivy is great! I was using Postgres FTS with trigrams to index a few hundred thousand address strings for a project of mine [0], but this didn’t scale as well as I’d hoped with a couple million addresses. Replaced it with the tantivy cli [1] and it works a charm (ms searches on a single core vm).

[0]: https://wynds.com.au [1]: https://github.com/quickwit-oss/tantivy-cli

VoVAllen2y ago

Did you create index on the tsvector?

blopker2y ago· 1 in thread

This would be cool to compile to wasm and ship to the browser. Seems like it would give a static site super fast search powers.

kdeldycke2y ago

I ‘m using https://stork-search.net for my static website search, but it’s no longer maintained. So yeah, Tantivy would be a great candidate to replace it! :)

karmakaze2y ago

Another resource is a trigram search index (in Go) used by etsy/hound[0] based on an article (and code) from Russ Cox: Regular Expression Matching with a Trigram Index[1].

[0] https://github.com/hound-search/hound

[1] http://swtch.com/~rsc/regexp/regexp4.html

Different use-cases for alternatives to Lucene depending on your needs.

kernelsanderz2y ago

Tantivy is also used in an interesting Vector Database product called LanceDb - https://lancedb.github.io/lancedb/fts/ to provide full text search capabilities. Last time I looked it was only through the python bindings, though I know they're looking to implement the rust bindings natively to support other platforms.

axegon_2y ago

I started working on a personal project a few years ago, after being insanely frustrated with the resource hog that is elasticsearch. That is coming from someone who's personal computer has more resources than what a number of generous startups allocate for their product. I opted for Tantivy for two reasons: one was my desire to do the whole thing in rust and second was Tantivy itself: performance is 10/10, documentation is second to none and the library is as ergonomic as they get. Sadly the project was a bite that was way too big for a single guy to handle in his spare time, so I abandoned it. Regardless, Tantivy is absolutely awesome.

syrusakbary2y ago

I've been following Tantivy for a little while. It's impressive the grit that the founders have, and the performance that Tantivy has been able to achieve lately.

Mad props to all the team! I'm a firm believer they will succeed on their quest!

elric2y ago

As someone who's used Lucene and Solr extensively, my biggest wishlist item has been support for upgrades. Typically Lucene (and Solr, and ES) indexes cannot be upgraded to new versions (it is possible in some cases, but let's ignore that for convenience). For many large projects, reindexing is a very expensive (and sometimes impossible) ordeal.

There are cases where this will probably never be possible (fields with lossy indexing where the datatype's indexing algorithm changed), but in many cases all the information is there, and it would be really nice if such indexes could be identified and upgraded.

sujayakar2y ago

adding to the chorus here - this is great tech. we use it internally at convex for implementing OLTP full text search.

other than its runtime characteristics, the codebase is well organized and a great resource for learning about information retrieval.

remram2y ago

What I really want is being able to index documents in multiple languages. Not all my users use the same language, and I don't want their documents and queries to assume English (for stop words, stemming, etc). This is a limitation of most search libraries at this point.

You have a big list of separate libraries providing support for a variety of languages? Great. Unfortunately that doesn't help me make a real multi-language app though. Doing that work right now, with multiple indexes and routing the query, seems very difficult.

bomewish2y ago

Eagerly awaiting the day someone can figure out a tantivy extension to SQLite. That would be the best of all worlds…

anko2y ago

I would love if tantivy had a single file format, eg. .tantivy extension so you could drag it into a notebook like you can with .sqllite files.

jrh32y ago

Cheesy logo with a horse

- Their website :)

j / k navigate · click thread line to collapse

58 comments

45 comments · 19 top-level

kaathewiseOP2y ago· 7 in thread

I was searching for a Meilisearch alternative (which sends out telemetry by default) and found Tantivy. It's more of a search engine builder, but the setup looks pretty simple [0].

[0]: https://github.com/quickwit-oss/tantivy-cli

ukuina2y ago

QuickWit also sends telemetry by default: https://quickwit.io/docs/telemetry

OtomotO2y ago

Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

The java sdk of meilisearch was also nice, same: no need for a cli and manual configuration. I just pointed it to a db entity and indexed whole tables...

Would love that for tantivy

PSeitz2y ago

> Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

Yes that's how you use tantivy normally, not sure which json config you mean.

tantivy-cli is more like a showcase, https://github.com/quickwit-oss/tantivy is the actual project.

1 more reply

banish-m42y ago

That's a petty objection to usable interactive search when it's easy to opt-out by adding a single command line argument.

soulofmischief2y ago

OP is entitled to make political choices when selecting software.

Some of us have specific principles of which things like opt-out telemetry might run afoul.

OP will choose their software, I choose mine and you choose yours; none of us need to call each other petty or otherwise cast such negative judgement; a free market is a free market.

1 more reply

kaathewiseOP2y ago

1 more reply

Nathanba2y ago

also meilsearch is more like quickwit, their distributed offering but quickwit is AGPL

1 more reply

leyoDeLionKin2y ago· 6 in thread

but y not just a vector database like pgvector?

kernelsanderz2y ago

Often you take the results from both vector search and lexical search and merge them through algorithms like Reciprocal Rank Fusion.

teraflop2y ago

1 more reply

demilich2y ago

Check https://github.com/infiniflow/infinity which combines vector search and full-text search providing extremely fast search performance.

jasfi2y ago

Infinity looks interesting, but I don't see any mention of support for clustering.

1 more reply

CuriouslyC2y ago

FridgeSeal2y ago

Because it’s a full text search engine, and not a text embedding? Different query types, requirements, indexing methods, etc.

mmastrac2y ago· 5 in thread

[1] https://github.com/progscrape/progscrape

snorremd2y ago

It is a very nice library. I’m using it for a very work in progress incremental email backup CLI tool for email providers using JMAP.

If anyone wants search for their Rust application they should take a look at Tantivy.

CaptainOfCoit2y ago

Tiny bug report: https://progscrape.com/?search=grep shows "Error: PersistError(UnexpectedError("Storage fetch panicked"))"

mmastrac2y ago

It looks like there was a bug with certain search queries that wedged a mutex because they failed to parse on my end. Deploying a fix now. Thanks!

OtomotO2y ago

Thanks for that! A couple of days ago I used meilisearch for a quick proof of concept, but I'll check out tantivy again via your repo.

I basically just need a fulltext search.

worble2y ago

If you just need full text search, assuming you're already using Postgres you can get quite far just using it's own primitives

https://www.postgresql.org/docs/current/textsearch.html

https://www.crunchydata.com/blog/postgres-full-text-search-a...

1 more reply

adeptima2y ago· 2 in thread

Found recently Tantivy inside of ParadeDB (Postgres extension aiming to replace Elastic)

https://github.com/paradedb/paradedb/blob/dev/pg_search/Carg...

after listening

Extending Postgres for High Performance Analytics (with Philippe Noël) https://www.youtube.com/watch?v=NbOAEJrsbaM

And inside of the main thing - Quickwit(logs, traces, and soon metrics) https://github.com/quickwit-oss/quickwit

Had a surprisingly good experience with combined power of Quickwit and Clickhouse for multilingual search pet project. Finally something usable for Chinese, Japanese, Korean

https://quickwit.io/docs/guides/add-full-text-search-to-your...

to_tsvector in PG never worked well for my use cases

SELECT * FROM dump WHERE to_tsvector('english'::regconfig, hh_fullname) @@ to_tsquery('english'::regconfig, 'query');

Wish them to succeed. Will automatically upvote any post with Tantivy as keyword

fulmicoton2y ago

Thank you so much for sharing!!!

tarasglek2y ago

That's a cool design pattern combining url/rest based index and doing the search query entirely within sql. Can do same thing in postgres fdw

throwaway815232y ago· 2 in thread

elric2y ago

Why did you "hate the Java underneath it"?

beanjuiceII2y ago

because it wouldn't let their power level reach 9000

tylerkovacs2y ago· 1 in thread

Huge bonus: the team is very responsive and helpful on Discord.

fulmicoton2y ago

Thank you @tyler!!!

yencabulator2y ago· 1 in thread

Beware, you still cannot add/remove fields: https://github.com/quickwit-oss/tantivy/issues/470

The only way to add fields is to reindex all data into a different search index.

francoismassot2y ago

One workaround is to use the JSON field, see doc https://github.com/quickwit-oss/tantivy/blob/main/doc/src/js...

yodaarjun2y ago· 1 in thread

[0]: https://wynds.com.au [1]: https://github.com/quickwit-oss/tantivy-cli

VoVAllen2y ago

Did you create index on the tsvector?

blopker2y ago· 1 in thread

This would be cool to compile to wasm and ship to the browser. Seems like it would give a static site super fast search powers.

kdeldycke2y ago

I ‘m using https://stork-search.net for my static website search, but it’s no longer maintained. So yeah, Tantivy would be a great candidate to replace it! :)

karmakaze2y ago

Another resource is a trigram search index (in Go) used by etsy/hound[0] based on an article (and code) from Russ Cox: Regular Expression Matching with a Trigram Index[1].

[0] https://github.com/hound-search/hound

[1] http://swtch.com/~rsc/regexp/regexp4.html

Different use-cases for alternatives to Lucene depending on your needs.

kernelsanderz2y ago

axegon_2y ago

syrusakbary2y ago

I've been following Tantivy for a little while. It's impressive the grit that the founders have, and the performance that Tantivy has been able to achieve lately.

Mad props to all the team! I'm a firm believer they will succeed on their quest!

elric2y ago

sujayakar2y ago

adding to the chorus here - this is great tech. we use it internally at convex for implementing OLTP full text search.

other than its runtime characteristics, the codebase is well organized and a great resource for learning about information retrieval.

remram2y ago

bomewish2y ago

Eagerly awaiting the day someone can figure out a tantivy extension to SQLite. That would be the best of all worlds…

anko2y ago

I would love if tantivy had a single file format, eg. .tantivy extension so you could drag it into a notebook like you can with .sqllite files.

jrh32y ago

Cheesy logo with a horse

- Their website :)

j / k navigate · click thread line to collapse