We started Quickwit 3 years ago with a POC, "Searching the web for under $1000/month" (see HN discussions [0]), with the goal of making a robust OSS alternative to Elasticsearch / Splunk / Datadog.
We have reached a significant milestone with our latest release (0.7) [1], as we have witnessed users of the nightly version of Quickwit deploy clusters with hundreds of nodes, ingest hundreds of terabytes of data daily, and enjoy considerable cost savings.
To give you a concrete example, one company is ingesting hundreds of terabytes of logs daily and migrating from Elasticsearch to Quickwit. They divided their compute costs by 5x and storage costs by 2x while increasing retention from 3 to 30 days. They also increased their durability, accuracy with exactly-once semantics thanks to the native Kafka support, and elasticity.
The 0.7 release also brings better integrations with the Observability ecosystem: improvements of the Elasticsearch-compatible API and better support of OpenTelemetry standards, Grafana, and Jaeger.
Of course, we still have a lot of work to be a fully-fledged observability engine, and we would love to get some feedback or suggestions.
To give you a glance at our 2024 roadmap, we planned to focus on Kibana/OpenDashboard integration, metrics support, and pipe-based query language.
[0] Searching the web for under $1000/month: https://news.ycombinator.com/item?id=27074481
[1] Release blog post: https://quickwit.io/blog/quickwit-0.7
[2] Open Source Repo: https://github.com/quickwit-oss/quickwit
[3] Home Page: https://quickwit.io
Unfortunately, I think it's a large reason why it was made a design choice that indexed documents are immutable (IIRC), which doesn't work for my use case.
It's possible to add updates/deletes to Quickwit, but this is a lot of work, and for now, we have not prioritized this development.
Do you mind sharing your use case?
Some web pages update frequently, so I would need something that can handle that.
I do understand that you seem to be targeting the log use-case of Elasticsearch moreso than the "Apache Solr" use-case.
I would be in passing curious of what about the GDPR makes infrequent deletes a design choice. My understanding of GDPR is that the "right to be forgotten" aspects would if anything require deletions to be inexpensive.
For use cases where you have a lot of QPS, we recommend using Garage or MinIO or the feature to cache index data on the local disks (new in the 0.7).
I wrote a blog post that explains how we do that: https://quickwit.io/blog/quickwit-101
Let's just consider IO, as it is the main effect here. With a columnar, you have to read all of the fields targeted by your query.
If this is log, let's assume
- 200B per line of logs
- 100TB of logs = 500 billions lines of logs
- 30 days of retention.
- a body text field taking 70% of your data.
- highly compressible (10x compression ratio). Note this will come with a higher cpu cost, but let's focus on an IO lower bound.
If you have 100TB of data, regardless of the query, you will have to read (and decompress, but let's not talk about cpu) 7TB worth of data.
Now with an inverted index? You will have to read the posting lists only. The posting lists are delta-encoded and bitpacked.
Assuming the probability of presence of a given term in a log line is p, the worst thing that can happen is having a token that is in 99% of the documents.
In that case, you will have to read 2.06 bits per documents. That's 128GB for the worst posting list.
If you are looking for a single keyword, in the worst possible case, you will have to read 54 times less data than with the columnar solution.
In practice, users search for several keywords, but also considerably less pathological than the example I just gave. Overall you will typically end up reading 20 to 100 times less data than with the grep solution.
I left CPU aside, but actual search engines are also much more CPU efficient.
But I said there was a trade-off... where is it? Well you had to pay a much higher cost at indexing.
Some search engine implementation makes it seem like indexing is more expensive than it should be. Quickwit/tantivy are especially efficient there. With a 4vCPUs VM, you can expect to index at 2TB/day. So in the example above, you will have to dedicate 8vCPUs for indexing. This is perfectly reasonable.
BUT if your retention is much much shorter (few days), indexing might not be worth it.
If your volume of data is small too, you probably do not need to care at all about efficiency.
The inverted index and columnar storage are part of tantivy [0], which is the fastest OSS search library out there (except for the academic project pisa) [1]. We maintain it, and we decided to build the distributed engine on top of it.
[0] tantivy github repo: https://github.com/quickwit-oss/tantivy
[1] tantivy bench https://tantivy-search.github.io/bench/
I like the example of Grafana with all their AGPL projects (Grafana, Loki, Tempo, ...). There are a LOT of companies using Grafana with the AGPL version.
You'd need Kafka, zookeeper, and Jaeger. All would need to be HA. Then also this service. Not mentioning postgres because in theory you can use aurora or the like.
How quick have your current customers been able to get up and running so far? And how much maintainence have they needed?
For the crazy large use cases, like the one described in the blog post, Kafka becomes necessary. At that scale, our users usually already have their data in Kafka or RedPanda and are actually happy to be able to get native integration: - their data does not need to be copied/replicated in a WAL "again" - we get exactly-once semantics
Also, in 0.8, we will be adding proper support for distributed ingest. The feature is actually already implemented and was originally scheduled for 0.7. but we preferred to test it more before actually shipping it.
Postgresql is not mandatory; it's also possible to use Quickwit with a metastore on S3. For large use cases, Postgresql is the way to go. I've seen users using Quickwit with metastore on S3, RDS, and Aurora.
On the UI side, we have several users who have their own UI. Jaeger is used just for the UI part so it's quite simple to have it in HA, I don't thing it's hard to have HA for Grafana but I'm not sure on this point.
Which docker compose did you look at?
I wonder if it's going to be the "python2.7" (or I guess more relevant "Java8") of running Kafka :-(
Few questions come to mind. Firstly, is Quickwit compatible with any S3 compatible object storage, such as Cloudflare's R2? Are there particular considerations to keep in mind for this kind of setup? Secondly, do you see Quicwit being used for analytics, such as tracking daily visits or analyzing user retention?
Your insights on these would be greatly appreciated.
> Secondly, do you see Quicwit being used for analytics, such as tracking daily visits or analyzing user retention?
Excellent question. Quickwit is very fast on ElasticSearch aggregations. We do not support cardinality-aggregation but it is scheduled for version 0.8.
Analyzing user retention and generally speaking running complex analytics, will not be possible any time soon. Maybe next year?
There is a great presentation from their recent conference [1]. I am asking this because I find it quite similar to what you are doing - decoupling storage from compute and utilising S3. I really appreciate if you could share your insights.
[1] starts at 30:45 https://m.youtube.com/watch?v=ZX7rA78BYS0
For OLTP, the equation is less obvious. I'd be scared to suffer from the cost associated with PUT/GET requests. (Pushing once every 30s or so equates to a few dollars per month.)
Since Scylla is based on an LSM-Tree (I think), I would have expected the talk to be about saving sstables on S3 but keeping the WAL on disk... But the slides seem to point to pushing the WAL to S3.
For Quickwit, our users proved to us it was scaling up to petabytes. So we consider this scale factor in the "alternative". But... we don't have a dedicated metrics storage engine yet, so if you want to store metrics in Quickwit, it won't be efficient in the current version. It will come later this year.
> it's comparable to Datadog feature wise but nowhere close in performance and scalability
Would love to understand how you tested SigNoz and what were the issues you found in performance and scalability?
Is it for a field with a high cardinality? If you tell us more about your use case, maybe we can find a workaround.
Quickwit is its own data engine.
We also have traces, metrics and logs in a single application which makes correlation across them much easier. From what I can understand from Quickwit website, they use Grafana and Jaeger for UI.
Here's our github repo if you want to check it out. https://github.com/signoz/signoz
I guess that's to be expected. Almost anything is more storage-efficient than Elasticsearch, FTS is so expensive.
I tried to tapping this with Grafana but never learned to have graphs as easily as with Kibana. Maybe I was not trying hard enough?
Has anyone replaced Kibana with Grafana for non time based graphs?
That being said, we also hear users complaining about OpenDashboard/Kibana, looking for an alternative different from Kibana/Grafana explore view (the view used for log and tracing search). You will also find users satisfied by the Grafana Explore view.
Personally, I don't find the Grafana Explore view great for log searches. I saw that Grafana recently made some improvements, and I need to dig into that to adapt the Quickwit Grafana plugin. I don't have a clear opinion on Kibana, one of my dreams is to build a better UI for log/traces search anyway, not yet on the roadmap though :)