Quickwit 0.8: Indexing and Search at Petabyte Scale (opens in new tab)

(quickwit.io)

115 pointsvvoyer2y ago28 comments

28 comments

Never had the chance to use Quickwit at a $DAYJOB (yet?), but I really appreciate the fact that it scales down quite well too. Currently running it on my homelab, after a number of small annoyances using Loki in a single-node cluster, and it's been working very well with very reasonable resource usage.

I also decide to use Tantivy (the rust library powering/written by Quickwit) for my own bookmarking search tool by embedding it in Elixir, and the API and docs have been quite pleasant to work with. Hats of to the team, looking forward to what's coming next!

tecleandor2y ago

Ah Loki, I wanted to try it at my homelab bit it wasn't as simple as it says. Now I wanted to try Zincsearch or Openobserve. Have you tried that?

pranay012y ago

You might want to have a look at SigNoz [1] as well. We have also published some perf benchmark wrt Elastic & Loki [2] and have some cool features like logs pipeline for manipulating logs before ingestion

[1] https://github.com/signoz/signoz [2] https://signoz.io/blog/logs-performance-benchmark/

mdaniel2y ago

in case it matters to others, https://github.com/openobserve/openobserve/tree/v0.7.0 is the last Apache2 licensed copy before they went AGPL with 0.7.1

https://github.com/openobserve/openobserve/blob/v0.7.0/.env.... is some "onoz" for me, but just recently someone submitted https://github.com/aenix-io/etcd-operator to the CNCF sandbox so maybe things have gotten better around keeping that PoS alive

netingle2y ago

> it wasn't as simple as it says

mind elaborating? we built loki for some pretty massive scale but I've always tried to make it work at super small scale to. what went wrong?

bbkane2y ago

I use OpenObserve and I quite enjoy it

nextaccountic2y ago

Tantivity is great!

Here is a postgres extension that uses it to provide full text search

https://blog.paradedb.com/pages/introducing_bm25

https://news.ycombinator.com/item?id=37557127

francoismassot2y ago

tantivy, not tantivity!!!!!

francoismassot2y ago

Some companies are using it with AWS Lambda to scale to 0.

godber2y ago

We did some experimentation with quickwit about a year ago, writing about 1m docs/second of data into it for several months. It worked well and was pretty straight forward to learn and operate. If we didn’t also manage our own S3/Ceph it might be a big win, once feature complete. It’s definitely worth a look.

lrx2y ago

I think you can use quickwit with a self-hosted S3-compatible object store.

halvorbo2y ago

Amazing to see how far Tantiviy has come. Remember using and making some smaller contributions to this 3 years ago - slop to phrase queries for example. Curious how the design has changed to enable large scale production usage.

francoismassot2y ago

Thanks! Quickwit is the distributed engine built on top of tantivy, we basically separated compute and storage for search, I wrote this blog post to introduce the architecture: https://quickwit.io/blog/quickwit-101

PS: it’s tantivy!!!

fulmicoton2y ago

Very valuable contribution!

up2isomorphism2y ago

13.4GB/s with 200x6 vcpus, gives 11MB/s per core, it is good but hard to say impressive.

francoismassot2y ago

Building the inverted index is quite CPU-intensive, and we are also merging index files called "splits".

kikimora2y ago

I never being able to understand why log indexing has to build inverted index. Decent columnar store with partitioning by date should be enough to quickly filter gigabytes of logs.

nh22y ago

Because you want to find all occurrences of "error abc123" over the last year, immediately?

fulmicoton2y ago

Quickwit co-founder here... I actually agree. For a few GBs, done right, columnar works fine AND is cost efficient.

After all, it does not matter much if a log search query answers in 300ms or 1s. However, there are use cases where a few GB just does not cut it.

The tale saying that you can always prune your dataset using timestamp and tags is simply not always valid.

1 more reply

fulmicoton2y ago

What is your frame of reference?

dist1ll2y ago

Per-core store bandwidth is at least 14GB/s on Zen3, 35GB/s for non-temporal stores. Parsing JSON can be done at +2GB/s.

It's very healthy to take maximum bandwidth limits into consideration when reasoning about performance. For instance, for temporal stores, the bottlenecks you see are due to RAM latency and memory parallelism, because of the write-allocate. The load/store uarch can actually retire way more data from SIMD registers.

So there's already some headroom for CPU-bound tasks. For instance 11MB/s is very slow for JIT baseline compiler. But if your particular problem demands arbitrary random access that exceed L3 regularly, maybe that speed is justified.

fulmicoton2y ago

What we do is CPU bound and we are not just parsing JSON here.

The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:

  inverted_index = defaultdict(list)
  for (doc_id, doc_json) in enumerate(doc_jsons):
    c = json.loads(payload)
    for (field, field_text) in c.items():
      for (position, token) in enumerate():
        inverted_index[token].push((doc, position))

serialize_in_compressed_way_that_allows_lookup(inverted_index)

You can implement it in a couple of hours in the language of your choice to get a proper baseline.

I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.

If someone knows a project I should know of, I'd be genuinely keen on learning from it.

2 more replies

arisudesu2y ago

musl support would be highly appreciated.

fulmicoton2y ago

We used to have one. Maybe we can revive it. What is your use case?

j / k navigate · click thread line to collapse

28 comments

dracyr2y ago

tecleandor2y ago

Ah Loki, I wanted to try it at my homelab bit it wasn't as simple as it says. Now I wanted to try Zincsearch or Openobserve. Have you tried that?

pranay012y ago

[1] https://github.com/signoz/signoz [2] https://signoz.io/blog/logs-performance-benchmark/

mdaniel2y ago

in case it matters to others, https://github.com/openobserve/openobserve/tree/v0.7.0 is the last Apache2 licensed copy before they went AGPL with 0.7.1

netingle2y ago

> it wasn't as simple as it says

mind elaborating? we built loki for some pretty massive scale but I've always tried to make it work at super small scale to. what went wrong?

bbkane2y ago

I use OpenObserve and I quite enjoy it

nextaccountic2y ago

Tantivity is great!

Here is a postgres extension that uses it to provide full text search

https://blog.paradedb.com/pages/introducing_bm25

https://news.ycombinator.com/item?id=37557127

francoismassot2y ago

tantivy, not tantivity!!!!!

francoismassot2y ago

Some companies are using it with AWS Lambda to scale to 0.

godber2y ago

lrx2y ago

I think you can use quickwit with a self-hosted S3-compatible object store.

halvorbo2y ago

francoismassot2y ago

PS: it’s tantivy!!!

fulmicoton2y ago

Very valuable contribution!

up2isomorphism2y ago

13.4GB/s with 200x6 vcpus, gives 11MB/s per core, it is good but hard to say impressive.

francoismassot2y ago

Building the inverted index is quite CPU-intensive, and we are also merging index files called "splits".

kikimora2y ago

I never being able to understand why log indexing has to build inverted index. Decent columnar store with partitioning by date should be enough to quickly filter gigabytes of logs.

nh22y ago

Because you want to find all occurrences of "error abc123" over the last year, immediately?

fulmicoton2y ago

Quickwit co-founder here... I actually agree. For a few GBs, done right, columnar works fine AND is cost efficient.

After all, it does not matter much if a log search query answers in 300ms or 1s. However, there are use cases where a few GB just does not cut it.

The tale saying that you can always prune your dataset using timestamp and tags is simply not always valid.

1 more reply

fulmicoton2y ago

What is your frame of reference?

dist1ll2y ago

Per-core store bandwidth is at least 14GB/s on Zen3, 35GB/s for non-temporal stores. Parsing JSON can be done at +2GB/s.

fulmicoton2y ago

What we do is CPU bound and we are not just parsing JSON here.

The largest work we do is building an inverted index. Oversimplified, it is equivalent to this:

  inverted_index = defaultdict(list)
  for (doc_id, doc_json) in enumerate(doc_jsons):
    c = json.loads(payload)
    for (field, field_text) in c.items():
      for (position, token) in enumerate():
        inverted_index[token].push((doc, position))

serialize_in_compressed_way_that_allows_lookup(inverted_index)

You can implement it in a couple of hours in the language of your choice to get a proper baseline.

I am sure we can still improve our indexing throughput... but I have never seen any search engine indexing as fast as tantivy.

If someone knows a project I should know of, I'd be genuinely keen on learning from it.

2 more replies

arisudesu2y ago

musl support would be highly appreciated.

fulmicoton2y ago

We used to have one. Maybe we can revive it. What is your use case?

j / k navigate · click thread line to collapse