Not the archetypal time series use case, but TimescaleDB is still really useful.
TSDB's compression means we can store a huge volume of data in a fraction of the space of a standard Postgres table. You can achieve even better compression ratios and performance if you spend time designing your schema carefully, but honestly we didn't see the need, as just throwing data in gets us something like 10:1 compression and great performance.
TSDB's chunked storage engine means that queries along chunking dimensions (e.g. timestamp) are super-fast, as it knows exactly which files to read.
Chunking also means that data retention policies execute nearly instantaneously, as it's literally just deleting files from disk, rather than deleting rows one-by-one - millions of rows are gone in an instant!
And best of all, this all works in Postgres, and we can query TSDB data just the same as regular data.
All that combined easily justified the decision to use TSDB - and if you're familiar with Postgres, it's actually really simple to get started with. Really, we'd of needed a business justification not to use it!
Much love for the TimescaleDB team!
We wrote more about this for an earlier Timescale blog post: https://blog.timescale.com/blog/how-everactive-powers-a-dens...
That is where a time series DB is very handy
1. Render job logs, combined with host metrics to determine if any systems are being under utilized. 2. Intranet network usage for monitoring purposes. 3. General monitoring of log events, system metrics, etc with alerting based on specific query criteria. 4. Temperature / Humidity monitoring and trend analysis.
Basically if there are things that happen somewhat often (http logs, system logs, host metrics, power usage, temperature change, etc) and you want to analyze that data, and one of the factors in your analysis is time, then time series databases are for you.
In the end each TSDB has different trade offs, and there are a number of considerations to make, and how you weigh them will change a lot depending on scale. If you are writing 1-20GB/Day vs 1-20TB/Day, you will have different considerations. If you have relatively simple queries vs complex queries with complex joins on other data you will have different considerations. In the end you need to have some idea what your data is, how much of it there is, what the growth looks like and what your write/query usage patterns are going to look like. Insert / Query performance isn't everything, unless for your business it is.
Large scale infrastructure monitoring?
If you run a data center with 10K machines in it, whitebox monitoring of these machines and what runs on it generates tons of timestamped data.
These time series can be used to inform an automated alerting system (eg using trends to forecast bad things before they happen).
They can also be analyzed in batch mode to figure out how to optimize many things (power / cooling / workload assignment / etc ...)
FRED (https://fred.stlouisfed.org/) is a great resource for publicly available economic time series data.
This is a great post to give you some talking points:
https://blog.timescale.com/blog/what-the-heck-is-time-series...
I also love this recent one we did with some non-standard time-series data that the NFL provided! Really fun working on that data set.
https://blog.timescale.com/blog/hacking-nfl-data-with-postgr...
Simple avg, etc were better, but always clickhouse was an order of magnitude faster than timescale. We didn't invest a whole bunch into optimization other than trying some indexing strategies in timescaledb.
So for our use case the choice is clear.
Thanks for the feedback. Without knowing your situation, one of the things we show in the blog post is that TimescaleDB compression often changes the game on those kinds of queries (data is transformed to columnar storage when you compress). You don't mention if you did that or not, but it's something we've seen/noticed in every other benchmark at this point - that folks don't enable it for the benchmark.
And second point of the article is that you have lots of options for whatever works in your specific situation. But, make sure you're using the chosen database features before counting it out. :-)
From what I can tell it comes down to execution engine differences. TimeScale, even with compressed tables, uses a row by row execution engine architecturally resembling IE6 era JS engines. ClickHouse uses a batched and vectorized execution engine utilizing SIMD. Difference is one to two orders of magnitude of throughput in terms raw number of rows per core pushed through the execution engine.
Postgres/TimeScale could certainly also implement a similar model of execution, but to call it an undertaking would be an understatement considering the breadth and extensibility of features that the execution engine would need to support. To my knowledge no one is seriously working on this outside of limited capability hacks like vops or PG-Strom extensions.
You do a great job summarizing some of the benefits of ClickHouse we mentioned in the post, including the vectorized engine!
That said, I'm not sure I'd refer to PostgreSQL/TimescaleDB engine architecture as resembling IE6 JS support. Obviously YMMV, but every release of PG and TimescaleDB bring new advancements to query optimizations for the architecture they are designed for, which was the focus of the post.
I'm personally still impressed, after 20+ years of working with SQL, relational databases, when any optimization engine can use statistics to find the "best" plan among (potentially) thousands in a few ms. Maybe I'm too easily impressed. :-D
I'm struggling to imagine a case where these are the two things being considered; Timescale is the obvious choice for a primary database, Clickhouse the obvious choice for a warehouse. I wouldn't let my user-facing app write to Clickhouse, and while I could potentially get away with a read-only Timescale replica for internal-facing reports I would expect to eventually outgrow that and reach for Clickhouse/Snowflake/Redshift.
I’ve been thinking of doing exactly that. What are your concerns?
but clickhouse is very unlike the other two. when i think of a warehouse i think star schema, data modeling ect not something that hates joins.
I'd still argue Clickhouse is closer to Snowflake/Redshift than anything OLTP, and their name is intentionally chosen to evoke warehouse-like scenarios.
Howdy! We provided all of those details in the post and you're welcome to join us next week when we live-stream our setup and test!
https://blog.timescale.com/blog/what-is-clickhouse-how-does-...
From what I see, the trade off in disk space usage would point me toward Timescale for most of my workloads. The insert performance tradeoff just wouldn’t justify the difference for me.
Thanks for the compliment! It's becoming a habit with us and benchmarks. We just really want to dig in and understand what's going on and why things work the way they do. ;-)
There really are so many nuances and as we tried to say a number of times, ClickHouse is really great at what it does well. But it's still OLAP at heart (which precludes OLTP features many apps take for granted) and after enabling TimescaleDB compression, the query story isn't as cut and dry. We don't claim that TimescaleDB is the fastest in all circumstances, or that it absolutely has to be for every workload. Features and versatility play a major part in the decision.
I would assume it does but reading the article implies that it does not.
The code that's currently used by TSBS was submitted by Altinity, a heavy supporter of ClickHouse in the U.S., but TSBS is open source and anyone is welcome to contribute and make the process/test better!
May be worth pointing that out in the article since the increased disk usage has been mentioned multiple times in the article without any indication that it's only temporary until ClickHouse merges the parts.
How much you save is again exactly the same as when dealing directly with files: it depends on the data and on the compression algo.
https://github.com/timescale/tsbs/blob/master/scripts/load/l...
vs
https://github.com/timescale/tsbs/blob/master/scripts/load/l...
Agreed. At a previous work, clickhouse outperformed timescale by several orders of magnitude, under about every condition.
The timescale team seems to recognize that (look for the comment about clickhouse being a bulldozer) but they seem to say timescale can be better suited.
In my experience, in about 1% of the cases, yes, timescale will be a better choice (ex: if you do very small batches of insertions, if you need to remove some datapoints) but in 99% of the usecases for a time series database, clickhouse is the right answer.
There seems to have been several improvements to timescale since 2018, with columnar storage, compression, etc. and that's good because more competition is always better.
But in 2021, clickhouse vs timescale for a timeseries is like postgres vs mongo for a regular database: unless you have special constraints [*], the "cool" solution (timescale or mongo) is the wrong one.
[*]: you may think you have a unique problem and you need unique features, but odds are, YAGNI
I always find comments like this interesting :-). Things are better for different use cases. If you find yourself inserting a lot of data in batches for OLAP-style analysis, then ClickHouse is a better choice today.
If you find yourself performing a lot of time-series related queries, and needing to build an application on top (e.g., where you might want the OLTP features of Postgres), then Timescale is the better choice.
YMMV! And that's OK :-)
> But in 2021, clickhouse vs timescale for a timeseries is like postgres vs mongo for a regular database: unless you have special constraints [*], the "cool" solution (timescale or mongo) is the wrong one.
This is also a funny statement, because TimescaleDB is built on PostgreSQL.
We actually take great pride in being a "boring" option [0] - in fact I think TimescaleDB is many ways is more "boring" than ClickHouse (again, because of its PostgreSQL foundation). But I think that's actually a good thing - because you should want your database to be "boring" - ie you shouldn't have to worry about it!
(Disclaimer: TimescaleDB co-founder)
[0] https://blog.timescale.com/blog/when-boring-is-awesome-build...
I know, but doing timeseries with postgres is "cool", not standard, not boring. I'd even say "risky".
> We actually take great pride in being a "boring" option
No, you're not there yet: doing timeseries with timescale is way riskier than with clickhouse, which is both a bit older (not much) and more mature (much more), while also being more widely used (even if you are doing a lot of outreach like these posts)
https://github.com/timescale/tsbs/blob/a045665d9c94426bbc405...
https://github.com/timescale/tsbs/blob/a045665d9c94426bbc405...
I'm not sure why you think that's creative engineering. What you're pointing to is the depth of available configuration that the contributors to TSBS have exposed for each database. It's totally open source and anyone is welcome to add more configuration and options! I believe (although not totally sure) that Altinity and ClickHouse folks added their code a few years ago - at least it wasn't anyone on the Timescale team.
That said, we didn't actually use those scripts to run our tests. Please join us next Wednesday (10AM ET/4PM CET) to see how we set the databases up and ran the benchmarks. We'd be delighted to have you try it on your own too!
https://github.com/ClickHouse/ClickHouse/graphs/contributors
https://github.com/elastic/elasticsearch/graphs/contributors
Edit: I was very off. The Github contributor graph does not show all actual contributors. ElasticSearch has somewhere around 2-3 times as many contributors as ClickHouse.
-- Elastic vs CH in a single table.
SELECT toYear(created_at) Year,
uniqIf(creator_user_login, repo_name in ('elastic/elasticsearch')) "Elastic",
uniqIf(creator_user_login, repo_name in ('yandex/ClickHouse', 'ClickHouse/ClickHouse')) "CLickHouse"
FROM github_events
WHERE event_type = 'PullRequestEvent'
AND merged = 1
AND repo_name in ('yandex/ClickHouse', 'ClickHouse/ClickHouse', 'elastic/elasticsearch')
GROUP BY Year ORDER BY Year
You can access the ClickHouse web UI for this dataset here: https://github.demo.trial.altinity.cloud:8443/play?user=demo. The password is "demo" (type it in the left side.) This is the Altinity.Cloud copy of Alexey Milovidov's excellent github_events dataset.When I run this query I get the following numbers.
Year|Elastic|CLickHouse|
----|-------|----------|
2015| 191| 0|
2016| 299| 40|
2017| 296| 85|
2018| 284| 142|
2019| 341| 232|
2020| 339| 300|
2021| 243| 294|
Just speculation on my part but the drop Elastic contributors may be a side effect of the licensing change.The blog post benchmarks used `--use-case=cpu-only` case for data ingestion. You can see the table definition here: https://github.com/timescale/tsbs/blob/1eb7705ff921fd31784c0... coming from here: https://github.com/timescale/tsbs/blob/master/pkg/targets/cl...
Howdy! All of the details about our TSBS settings in the performance section of the docs. Also, we'll be streaming a sample benchmark of the two databases next Wednesday at 10AM ET/4PM CET.
https://blog.timescale.com/blog/what-is-clickhouse-how-does-...
twitch.tv/timescaledb
- The CH table schema generated by TSBS isn't optimized for the queries. First of all, it doesn't uses CODEC (https://altinity.com/blog/2019/7/new-encodings-to-improve-cl...) and many other optimizations CH have.
> We tried multiple batch sizes and found that in most cases there was little difference in overall insert efficiency
This is wrong in CH world where batch size matters a lot. I would recommend keep this even more higher around (10x of current value).
Humble Suggestion: There are many things not quite properly interpreted about CH and reading through the blog it seems like you're focusing more on areas which CH is lacking/missing. Please don't do these things.
- The code that TSBS uses was contributed by Altinity[1]. If there is a better setup, please feel free to submit a PR. As stated elsewhere, we did have a former CH engineer review and even updated ClickHouse to the newest version __yesterday__ based on his suggestion to ensure we had the best numbers. (and some queries did improve after upgrading, which are the numbers we presented)
- It seems like you read the article (great job - it was long!!), so I'm sure you understand that we were trying to answer performance and feature questions at a deeper level than almost any benchmark we've seen to date. Many just show a few graphs and walk away. We fully acknowledged that smaller batches are not recommended by CH, but something many (normally OLTP) users would probably have. It matters and nobody (that we know of) has shown those numbers before. And in our test, larger batch sizes do work well, but not to some great magnitude in this one server setup. Did 10k or 20k rows maybe go a little faster for CH? Sometimes yes, sometimes negligible. The illustration was that we literally spent months and hundreds of benchmark cycles trying to understand the nuances.
I think we're pretty clear in the post that CH is a great database for the intended cases, but it has shortcomings just like TimescaleDB does and we tried to faithfully explore each side.
We used the Time Series Benchmark Suite for all these tests https://github.com/timescale/tsbs. Also, Ryan (post author) will be giving all the config details in a Twitch stream happening next Wednesday. We'll be uploading the video to Youtube immediately afterwards too >>
twitch.tv/timescaledb youtube.com/timescaledb
6,000,000 rows inserted per second is great! And if you need that for your workload, then you probably should choose ClickHouse over TimescaleDB (well at least, for now ;-)
The reason we don't include that in the benchmark is that most developers _don't need_ 6,000,000 rows inserted per second.
And also - that performance doesn't come for free, but requires giving up a lot of things that most developers may need: e.g., no transactions, immutable tables (can't easily update / delete data), SQL-like but not quite SQL query language, inefficient joins, inefficient for point queries retrieving single rows by their keys, etc. (We go into much more detail in the blog post.)
So it comes down to the fundamental analogy used in the post: Do you need a car (versatility) or a bull dozer (narrow specialization)?
If the answer is that you need to support 6,000,000 rows inserted per second, then by all means, choose the bull dozer.
> ClickHouse shines at scales that timescale has no hope of ever supporting.
I'm not sure if this was a throwaway line, or if it was the result of a detailed analysis of TimescaleDB's architecture, but if you don't mind, I'll share this: with TimescaleDB multi-node [0] we are getting close to that performance, and the product keeps getting better.
[0] https://blog.timescale.com/blog/building-a-distributed-time-...
Those are great, impressive numbers. We certainly don't claim to be all things to all people, but the benchmark was run using single instances mostly because that is what most other benchmarks published by others have done.
With a multi-node TimescaleDB cluster, ingest does literally scale to millions of rows/second and we have a numerous users achieving great numbers. One Fortune 100 company has a 25+ TimescaleDB cluster backing their network monitoring stack and loves it for their use case.
At some point, when we can, I'm sure we'll start to do more with multi-node benchmarking too to give some input to the conversation.
Interestingly, we were just testing a multi-node TimescaleDB cluster the other day and found that 75k rows/batch was the optimal size as nodes increased.
So you're completely correct. I tried to be very clear that we were not intentionally "cooking the books" and there's surely other optimizations we could have made. Most of the suggestions so far, however, require further setup of CH features that haven't been used in other benchmarks, so we tried to over communicate our strategy and process.
We also fully acknowledged in the post that an siloed "insert", wait, then "query" test is not real world. But, it's the current way TSBS has been used and other DB engines have come along and used the methodology for now. Maybe that process will change in time to come with other contributions.
BTW, we'll discuss some of this next week during the live-stream and the video will be available after.
* One series of events per user
* Each series grows at about 10 events/minute while the user is active
* Fancy queries are not required, typically a user's event series is consumed in order to update aggregate state for that user
* Either used online, adding events one at a time and needing to immediately update the aggregate state
* Used offline syncing a batch of hours or days at once. When a large time interval, eventually consistent state updates are acceptable
* It must be possible to delete a user's data, regardless how old it is (a nightly batch job deleting multiple users at once is fine, if it helps performance)
* Migrating old data should be possible with reasonable performance and without consuming excessive temporary memory
* Compact storage is important (simple zstd compression should suffice, though columnar compression might be slightly better)
* Being able to use a cheaper object store like S3 for old data would be nice
At a glance timescale community appears to meet most requirements. The userid can be used as `segmentby` key, and the data compressed via columnar compression. But it seems to have limitations with migration (sounds like it requires me to manually decompress and recompress chunks, instead of simply transforming one (chunk, segment) piece at a time) and deletion (I need to delete everything with a specific `segmentby` key).
Alternatively there is the DIY approach, of serializing each entry in a compact format, one file per user, and then once data is old enough compress it (e.g. with zstd) and upload it to S3.
Compression in TimescaleDB used to mean all compressed data was immutable and the table schema couldn't be altered. Since TimescaleDB 2.1, and 2.3 that has changed in a few ways.
- Schema can now have columns added or renamed - Compressed chunks can now have rows inserted into them (partially compressed, the background job will fully compress when it runs)
Row deletion is not possible yet, but I've personally been having some internal conversations around ways to do exactly as you're suggesting in the near-term; deleting rows based on a "segmentby" column. I have some testing to do... but my focus has been taken up by a certain, 33-minute long read, blog post.
Feel free to join our Slack and DM me if you want to talk about it further.
slack.timescale.com
What do you mean by "migrating old data"? Don't want to make assumptions before answering further.
I know timescale has native support for the most common cases (adding nullable columns/renaming columns). But sometimes the transformation is more complex. Sometimes an sql update statement would suffice, sometimes streaming each segment in chronological order to the application returning the updated row might be required.
PostgreSQL, TimescaleDB, and ClickHouse are all impressive pieces of software. We use both PostgreSQL and ClickHouse at Logtail.
ClickHouse shines for true OLAP use-cases and is very hard to beat performance-wise when configured properly.
Example:
> Poor inserts and much higher disk usage (e.g., 2.7x higher disk usage than TimescaleDB) at small batch sizes (e.g., 100-300 rows/batch).
If your consistency requirements allow, you could use the Buffer Table Engine to get blazing fast inserts: https://clickhouse.com/docs/en/engines/table-engines/special...
Horizontal scalability and compression are also unbeatable from what I've seen, to name a few.
There's a hefty price tag, however: ClickHouse is quite ops heavy and its observability has a seriously steep learning curve. Only go for ClickHouse in production if you really know what you're doing :)
A classic tenet of computer science is that you can trade speed for space and vice versa. A database index is an example of this concept.
While the article complains about "benchmarketing", its tone is rather dismissive of ClickHouse and parts such as this line caught my eye as unfair.
* PostGIS
* Timescale
* Citus
* Zedstore
This truly would be the relational DB to end all relational DBs. Unfortunately, we run into a couple problems:
* Managing multiple extensions is a burdensome task, which should be in the wheelhouse of cloud providers, but...
* Timescale and Citus are are open core, holding back features for customers. Their primary revenue channels are their cloud offerings. Unfortunately you can't get Citus and Timescale in the same cloud offering, cause you're dealing with two separate companies.
* PostGIS has multiple cloud providers, but none of them have Timescale or Citus available.
* Citus only has cloud offerings on Azure, excluding the other two major players that often have exclusive relationships with companies.
* Zedstore is really cool and together with Citus could be a massive gamechanger by having columnstore and rowstores in the same distributed database. However, development has stalled, and nobody seems to be able to explain what happened.
Sigh...maybe 5 years from now.
One clarification. While TimescaleDB is open-core, our community version is source-available and 100% free to use. We do not "hold back features for customers". You do not need to pay to use any of TimescaleDB's best features, it's all free via the Timescale Community license.
You only pay if you'd like to use our hosted offerings (and save the hassle of self-managing your DB): Timescale Cloud or Managed Service for TimescaleDB.
For more see: https://www.timescale.com/products
(Disclaimer: I work at Timescale)
Regarding distributed (Citus) and columnar (Zedstore):
- TimescaleDB's compression actually takes a columnar approach (including that it only reads the individual compressed columns that you SELECT), and so combines both row- and column-oriented data. [0]
- TimescaleDB 2.0 also supports distributed deployment, and Timescale Cloud will (very soon) offer one-click deployment of fully-managed multi-node TimescaleDB. [1]
[0] https://blog.timescale.com/blog/building-columnar-compressio...
[1] https://blog.timescale.com/blog/building-a-distributed-time-...
I also really wish ClickHouse would prioritize PostGIS support - IIRC it has been on their roadmap for a while but keeps getting kicked around every year or so. Same thing with CockroachDB - PostGIS support kicked down the road every year.
We've been working really hard on our launches / releases this month! We called it "Always Be Launching" - we've been aiming for releasing multiple things per week during October :)
However, as a DB where users may store critical data, should you really be "Always be launching"? That sounds a little like FB's "move fast and break things". There's a reason why some of the mission critical open source technologies move slowly.
Timescale (the company) also provides a managed cloud offering, as well as Promscale (an observability product built on top of TimescaleDB).
So #AlwaysBeLaunching is a company-wide effort across different product & engineering teams, as well as folks in Developer Advocacy and others (e.g., who worked on this comparison benchmarks).
What might be also interesting is our introduction of Experimental Schema features in TimescaleDB - explicitly so that we can "Move fast, but don't break things" (which is also key to getting good community feedback):
https://blog.timescale.com/blog/move-fast-but-dont-break-thi...
(Timescale co-founder)
Within the core database, we offer features that are carefully marked as "experimental", which we discuss at length in this blog post [1].
Beyond TimescaleDB, we also offer other products that are more SaaS-y in nature. While they're all based on the rock-solid foundation of TimescaleDB, we are also able to ship new features more quickly because they are UI components that make using the database even easier.
Finally, some of our "launches" are more textual in nature, such as this benchmark, which we have spent months researching and compiling.
[0]: https://blog.timescale.com/blog/when-boring-is-awesome-build...
[1]: https://blog.timescale.com/blog/move-fast-but-dont-break-thi...
That post was written in November 2018 - 3 years ago - when TimescaleDB was barely 1.0.
A lot has changed since then:
1. TimescaleDB launched native columnar compression in 2019, which completely changed its story around storage footprint and query performance [0]
2. TimescaleDB has gotten much better
3. PostgreSQL has also gotten better (which in turn makes TimescaleDB better)
In fact, IIRC Altinity used and contributed ClickHouse to the TSBS [1], which is also what this newer benchmark uses as well
(Disclaimer: TimescaleDB co-founder)
[0] https://blog.timescale.com/blog/building-columnar-compressio...
The two big things, which we discuss at length in the post, are:
- Altinity (and others) did not enable compression in TimescaleDB (which converts data into columnar storage) and provides improvement in querying historical data because it can retrieve individual columns in compressed format similar to CH
- They didn't explore different batch sizes to help understand how each database is impacted at various batch sizes.
Clickhouse design in particular suggests doing ingest request approximately once per second and if you do much more than that when you use it outside of intended usage and if you need that you usually have some sort of queue between whatever produces the data and Clickhouse.
Note the ingest in small batches also can significantly affect query performance
Actually Altinity is the one that contributed the bits to TSBS for benchmarking ClickHouse[1], so we are using the work that they contributed (and anyone is welcome to make a PR for updates or changes). We also had a former ClickHouse engineer look at the setup to verify it matched best practices with how CH is currently designed, given the TSBS dataset.
As for the optimizations in the article you pointed to from 2019 (specifically how to query "last point" data more efficiently in ClickHouse), it uses a different table type (AggregatedMergeTree) and a materialized view to get better query response times for this query type.
We (or someone in the community) could certainly add that optimization to the benchmark, but it wouldn't be using raw data - which we didn't think was appropriate for the benchmark analysis. But if one wanted to use that optimization, then one should also use Continuous Aggregates for TimescaleDB - ie for an apples to apples comparison - which I think would also lead to similar results to what we show today.
It's actually something we've talked about adding to TSBS for TimescaleDB (as an option to turn on/off) and maybe other DBs could do the same.
I think the most important thing is Clickhouse is NOT designed for small batch insertion, if you need to do 1000s of Inserts/sec you do queue in front of clickhouse. And query speed can be impacted by batch side a lot. So have you looked at query performance with optimal batch size ?
https://blog.timescale.com/blog/what-is-clickhouse-how-does-...
Compression is seriously impressive, we see ~90% compression rate on our real world datasets. Having that data right next to our regular postgres tables and being able to operate on it all transactionally definitely simplifies our application logic.
Where I see a lot of folks run into issues with TimescaleDB is that it does require that your related data models hold on to relevant timestamps. If you want to query a hypertable efficiently, you always want to be able to specify the relevant time range so that it can ignore irrelevant chunks. This may mean that you need to put data_starts_at, data_ends_at columns on various other tables in your database to make sure you always know where you find your data. This is actually just fine though, because it also means you have an easy record of those min / max values on hand and don't need to hit the hypertable at all just to go "When did I last get data for this device".
Does this effect your query performance ?
> This may mean that you need to put data_starts_at, data_ends_at columns on various other tables in your database to make sure you always know where you find your data.
Do you have a link to docs for this ? Does this mean literally put a first column named (xstartx) and an end column (xendx) as the last column ? How do you then utilize it ?
Thanks so much!
They are both fantastic engines, I really like that both have made very specific tradeoffs and can be very clear in what they are good and bad at. Having worked on database engines, I can appreciate the complexity that they are solving.
My most recent use is with ClickHouse, which is great and I think a complete game-changer for the company. However there's a lot of issues (that are being worked on, the core team is great, though there are a few personalities that are a bit frosty to deal with). All of these comments come with love for the system.
1. Joins really need some work, both in the kinds of algorithms (pk aware, merge joins that don't do a full sort etc.), and in query optimizer work to make them better. We have analysts that use our system, and telling them to constantly write subqueries for simple joins is a total PITA. Not having PK aware joins is a massive blocker for higher utilization at our company, which really loves CH otherwise.
2. Some personalities will tell you that not having a query optimizer is a feature, and from an operational standpoint, it is nice to know that a query plan won't change, or try and force the optimizer to do the right thing. However, given #1, making joins performant (we have one huge table with trillions of rows, and a few smaller ones with billions) is really rough.
3. The operations story really needs some work, especially the distribution model. The model of local tables with a distributed table over it is difficult to work with personally. It would be nice to just be able to plug servers in without alot of work, like Scylla, and not have two tables that you have to keep schemas consistent with. THere's also just some odd behavior, like if you insert async into a distributed table, and only have a few shards, it'll only use a thread per shard to move that data over. It would be nice if there wasn't as much to think about.
4. Following #3, there's just too many knobs, maybe if they had a tuning tool or something that would help, but configuring thread pools is difficult to get right. I suspect CH could use a dedicated scheduler like Scylla's, that could dispatch the work, instead of relying on the OS.
5. The storage system relies a lot on the underlying FS and settings on when to fsync etc. I suspect if they had a more dedicated storage engine (controlled by the scheduler above), things could be more reliable. I still don't fully trust data being safe with CH.
6. Deduplication - This is a hard problem, but one that is really difficult to solve in CH. We solve it by having our inserters coordinate so that they always produce identical blocks, using replacing merge trees to catch stragglers (maybe), but it isn't perfect. A suggestion if possible is to try and put the same keys into the same parts, so they'll always get merged out by the replacing merge tree (I understand this is difficult).
The CH team is great, and these will be fixed in time, but these were the problems we ran into with CH.
TSDB was really solid, but we never used it at a scale where it would tip over. Our use case is really aligned with Yandex's so a lot of the functionality they have built is useful to us in a way that TSDB's isn't. (Also, being able to page data to S3 is amazing).
Thanks for the great, thoughtful feedback. We (Timescale) couldn't agree more that there is a lot to love about ClickHouse, especially where it truly excels.
Information like this is helpful for others currently in the "choose the right tool" part of the job and to the developers of the product. I can't imagine how different all of our offerings will look in a few more years! :-)
I’ve found SingleStore column scans at parity with ClickHouse in speed. At same time SingleStore uses a hybrid skip-list, columnstore data structure in their universal storage (which is default table format).
So you have high throughput transactions, as well as insanely fast aggregate scans.
Usually in column stores, they are great at append, not so much with updates and deletes.
https://www.singlestore.com/assets/contracts/singlestore-fre...
2. Restrictions. You acknowledge that the Software, and its structure, organization, and source code, constitute SingleStore’s and its suppliers’ valuable trade secrets, and the Software is subject to the following restrictions. Specifically, Customer shall not, and shall not ... conduct any competitive analysis, publish or share with any third party any results any results of any technical evaluation or benchmark tests performed on the Software, or disclose Software features, errors or bugs to a third party without SingleStore’s prior written consent (“Benchmarking”); or
What is the most funny though I remember them in MemSQL days showing benchmarks vs Oracle (which as I understand also has the same restriction)
ClickHouse has a bit more ops overhead, but requires very little in the way of pre-planning. Just throw whatever you want at it and it seems to sing by default.
Regarding ops overhead, ClickHouse also has a "local" mode where you can query a huge range of compressed formats without ever performing any kind of ETL step. That means queries can run e.g. directly against S3 storage. For batch logs analysis, IMHO local is a gamechanger. Most bulk logging systems produce massive amounts of S3 objects, and ClickHouse lets you tear through these infrequently (but at high speed) when desired without any bulky/opsey ETL step, and no cluster running idle just waiting for a handful of daily jobs to arrive.
(I love this style of system in general, but the clear engineering work behind ClickHouse completely won me over, I'd apply it anywhere I could)
Thank you for the feedback - it is conversations like this that enable us to understand how we can continue to make TimescaleDB better.
And some of the ideas you are discussion are on our roadmap - if anyone wants to help, we are hiring :-)
This is generally true for most databases these days. Use an operator if it's available. Helm can't handle the dynamic management required to run databases properly.
e.g. https://clickhouse.com/docs/en/interfaces/third-party/proxy/
There's also comparisons of TimescaleDB vs MongoDB[1] and AWS Timestream [2].
[0]: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...
[1]: https://blog.timescale.com/blog/how-to-store-time-series-dat...
[2]:https://blog.timescale.com/blog/timescaledb-vs-amazon-timest...
SELECT r.country, m.time, SUM(m.measurement) FROM measurement_table AS m INNER JOIN refence_table AS r ON m.device_id = r.device_id
TimescaleDB is PostgreSQL, so it can easily handle this kind of join aggregate like you would expect. If "m.measurement" was compressed, historical queries with a time predicate would likely be faster than uncompressed state.
We have a few users that have larger multi-node clusters setup this way (one at 40+ nodes so far) and happy with the offering overall. Obviously YMMV depending on requirements/IT support.
Remember, TSBS is open-source and we've had some great contributions from many teams/databases. :-)
* Create a setup which is production grade i.e. run a multi-node HA setup of those systems.
* Understand the best practices of those systems otherwise result gets biased.
* Validate the results with experts of those systems before publishing.
https://altinity.com/blog/2019/7/new-encodings-to-improve-cl...
In particularly low_cardinality() for strings and time series specific compression many be very valuable
1) Clickhouse allows to trivially setup a batch buffer on server side. It will always be faster on insert with low disk usage.
2) Suspicious axis on query benchmark. It seems it represents diff in query latencies. So if one executes query in 1.00 and another in 1.34 on graph it will be whopping 34.
It is very very hard to beat clickhouse in terms of performance if set up properly.