Lesser known features of ClickHouse (opens in new tab)

(pradeepchhetri.xyz)

158 pointspradeepchhetri4y ago33 comments

33 comments

28 comments · 8 top-level

CSDude4y ago· 7 in thread

We have a use case for wide columnar data, used for mostly performance analytics. There are many types of events that share same columns, but mostly total unique columns are 20x of a typical events columns. My use case is filtering by some boolean logic and aggregating them. I use Elasticsearch with a mapping that does not tokenize text fields, (all strings are keyword type) and it works very well. I can add nodes as I want and adjust shards with ease with spot instances and make it cost effective.

I have not found a way/tool to replace this. Many of the tools fail at dynamic data with cardinality. Wanted to use Clickhouse like this, adding columns as they are discovered but it did not go well, but it has been a while. Also adding replicas is not as easy as Elasticsearch.

Does anyone have a similar use case implemented with Clickhouse when data is not known before hand and unfolds as time goes by?

niviksha4y ago

Thanks for sharing this. It is a very interesting problem that highlights some of the technical challenges of working with modern event data, which happens to 'prefer' being semi-structured (i.e JSON is the most natural serialization format while creating events).

It's also something we're working on! Shameless plug - I happen to work at Sneller (sneller.io, open source at https://github.com/SnellerInc/sneller) that might be interesting to you.

A couple of key ideas - first, we bypass the need for any sort of 'semi-structured to relational' ETL/ELT overhead by running vectorized SQL on a (compressed) binary form of the JSON data which preserves its original structure. So we're schema-on-read first and foremost - you don't need to worry about adding new fields in the source JSON as long as your queries know of these new fields.

Second, we completely separate storage from compute. Unlike CH we don't use local disk as any sort of storage tier, and use cloud object stores as our _primary_ storage tier. So all your data (including the compressed binary version of your source JSON) lives in s3 buckets in your control.

Feel free to check us out and let us know what you think!

1. Github - https://github.com/SnellerInc/sneller

2. Intro blog - https://github.com/SnellerInc/blogs/blob/main/introducing-sn...

LunaSea4y ago

I think you should give PostgreSQL's intarray extension a try, especially using the "@@" QUERY_INT boolean logic queries over GIN indexed columns.

gauravphoenix4y ago

We at Dassana[1] use clickhouse underneath and solve the exact same problem. Have a look and feel free to reach out to me if you have any questions. You can find more details on my ShowHN post[2].

[1] https://lake.dassana.io

[2] https://news.ycombinator.com/item?id=31111432

datalopers4y ago

CH approach: Load the dynamic data into a JSON field and use materialized columns when needed for performance beyond calling JSONExtract().

gauravphoenix4y ago

the "Object" data type works very well and is quite performant. But do note that if you have lots of "schemas" you will need to create multiple such columns that represent such columns e.g. one column each for GitHub,GitLab etc.

2 more replies

nijave4y ago

Uber has a blog post about using a hybrid static column/array approach for log files that might be similar to your use case (they replaced Elasticsearch installations) https://eng.uber.com/logging/

pqyzwbq4y ago

you may check: https://github.com/jitsucom/jitsu. "Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days"

You can create an API endpoint, and send those JSON to it. In the "destination" part, it can sync to clickhouse (one of many choices, like redshift, snowflake,besides clickhouse) very quickly, and flatten the JSON into columns. If there is new key found in JSON, it will create a new column in clickhouse.

zX41ZdbW4y ago· 4 in thread

I recommend to write a follow-up with the topics:

1. clickhouse-git-import tool, what is it and how to use it. 2. ClickHouse as a monitoring agent with the system.asynchronous_log table. 3. Hosting static MergeTree tables on any web server.

pradeepchhetriOP4y ago

Thank you for your response and going through the blog. I am curious about how to implement

2. ClickHouse as a monitoring agent with the system.asynchronous_log table.

Can you share some links around it.

zX41ZdbW4y ago

About the system.asynchronous_metric_log table.

It's very simple. Every ClickHouse instance, even empty, is collecting system-wide monitoring information. You can query this table directly from Grafana dashboard (for example).

1 more reply

gauravphoenix4y ago

>3. Hosting static MergeTree tables on any web server.

Can you share some docs on that? AFAIK CH can use web server hosted data sources but I don't recall CH able to host merge tree tables on HTTP servers.

js24y ago

(1) is answered here https://github.com/apache/incubator-devlake/issues/1771

datalopers4y ago· 4 in thread

The best feature of Clickhouse is replacing entire data engineering and analytics teams and their “modern data platform” with one person and one application.

polskibus4y ago

How’s that? How does clickhouse replace ETLs? How does it alleviate data quality problems? It is just a data warehouse , very fast but with limited data modelling and permissioning compared to MS OLAP or tabular models.

benjaminwootton4y ago

Im not really disagreeing with him, but you can use DBT to materialise 10 daisy chained views, or hit Clickhouse with a query and allow it to crunch data.

I am a proponent of modern data stack but it’s always at the back of my mind that an OLAP database on a big box negates the need for a lot of it.

1 more reply

datalopers4y ago

ETL: Clickhouse can probably directly read any format or data source you have.

Quality: A diligent human like anywhere.

Data Modeling: I disagree, what do you think it is missing? I’ve never encountered anything I couldn’t achieve, and the Clickhouse specific aggregates, lambda support, and array functionality really simplifies many problems.

Permissioning: It has very granular permissions.

gauravphoenix4y ago

If you use the "Object" data type (json), you can have a schema-less datastore and that can alleviate the need of complex ETL pipelines. But yes, if you need to decorate the data before ingestion, you need some ETL pipeline. In other words, if the ETL pipeline is just storing the data in columnar format by reading some JSON fields, you don't need that anymore with the new CH json data type

collegeburner4y ago· 3 in thread

Kind of related question: we really like clkckhouse but want a grafana alternative or at least an alternative to the vertamedia adapter. The adapter burns way more resources then clickhouse itself.

pradeepchhetriOP4y ago

Grafana Labs recently wrote a new clickhouse adapter.

https://github.com/grafana/clickhouse-datasource

https://grafana.com/blog/2022/05/05/introducing-the-official...

hodgesrm4y ago

We maintain the Vertamedia adapter at Altinity and have hundreds of dashboards ourselves plus OSS users have countless thousands more. Can you log an issue so we can look at it? I've never heard of this problem but would love to learn more about it.

https://github.com/Altinity/clickhouse-grafana/issues

pradeepchhetriOP4y ago

The vertamedia clickhouse plugin works perfectly for me. Have you tried opening an issue upstream.

gynther4y ago· 2 in thread

A very good article. Can anyone recommend more like this to learn more about clickhouse?

nemo44x4y ago

https://clickhouse.com/learn/

hodgesrm4y ago

https://altinity.com/blog

(Disclaimer: I work at Altinity.)

slotrans4y ago

Here's one not mentioned in the article (currently an experimental feature gated by a toggle): ClickHouse can consume a Postgres logical replication stream. Instead of paying for Stitch/Fivetran/etc, you can get OLTP->OLAP replication for free, with no additional software. It's not as flexible as the commercial stuff, and has several sharp edges, but still it's very useful and just an amazing thing to be built-in.

zX41ZdbW4y ago

In clickhouse-local example, you can omit the --structure argument and it will continue to work with automatic schema inference.

ballenf4y ago

I don't know what click house is, but that personal site is really awesome.

j / k navigate · click thread line to collapse

33 comments

28 comments · 8 top-level

CSDude4y ago· 7 in thread

Does anyone have a similar use case implemented with Clickhouse when data is not known before hand and unfolds as time goes by?

niviksha4y ago

It's also something we're working on! Shameless plug - I happen to work at Sneller (sneller.io, open source at https://github.com/SnellerInc/sneller) that might be interesting to you.

Feel free to check us out and let us know what you think!

1. Github - https://github.com/SnellerInc/sneller

2. Intro blog - https://github.com/SnellerInc/blogs/blob/main/introducing-sn...

LunaSea4y ago

I think you should give PostgreSQL's intarray extension a try, especially using the "@@" QUERY_INT boolean logic queries over GIN indexed columns.

gauravphoenix4y ago

We at Dassana[1] use clickhouse underneath and solve the exact same problem. Have a look and feel free to reach out to me if you have any questions. You can find more details on my ShowHN post[2].

[1] https://lake.dassana.io

[2] https://news.ycombinator.com/item?id=31111432

datalopers4y ago

CH approach: Load the dynamic data into a JSON field and use materialized columns when needed for performance beyond calling JSONExtract().

gauravphoenix4y ago

2 more replies

nijave4y ago

Uber has a blog post about using a hybrid static column/array approach for log files that might be similar to your use case (they replaced Elasticsearch installations) https://eng.uber.com/logging/

pqyzwbq4y ago

zX41ZdbW4y ago· 4 in thread

I recommend to write a follow-up with the topics:

1. clickhouse-git-import tool, what is it and how to use it. 2. ClickHouse as a monitoring agent with the system.asynchronous_log table. 3. Hosting static MergeTree tables on any web server.

pradeepchhetriOP4y ago

Thank you for your response and going through the blog. I am curious about how to implement

2. ClickHouse as a monitoring agent with the system.asynchronous_log table.

Can you share some links around it.

zX41ZdbW4y ago

About the system.asynchronous_metric_log table.

It's very simple. Every ClickHouse instance, even empty, is collecting system-wide monitoring information. You can query this table directly from Grafana dashboard (for example).

1 more reply

gauravphoenix4y ago

>3. Hosting static MergeTree tables on any web server.

Can you share some docs on that? AFAIK CH can use web server hosted data sources but I don't recall CH able to host merge tree tables on HTTP servers.

js24y ago

(1) is answered here https://github.com/apache/incubator-devlake/issues/1771

datalopers4y ago· 4 in thread

The best feature of Clickhouse is replacing entire data engineering and analytics teams and their “modern data platform” with one person and one application.

polskibus4y ago

benjaminwootton4y ago

Im not really disagreeing with him, but you can use DBT to materialise 10 daisy chained views, or hit Clickhouse with a query and allow it to crunch data.

I am a proponent of modern data stack but it’s always at the back of my mind that an OLAP database on a big box negates the need for a lot of it.

1 more reply

datalopers4y ago

ETL: Clickhouse can probably directly read any format or data source you have.

Quality: A diligent human like anywhere.

Permissioning: It has very granular permissions.

gauravphoenix4y ago

collegeburner4y ago· 3 in thread

Kind of related question: we really like clkckhouse but want a grafana alternative or at least an alternative to the vertamedia adapter. The adapter burns way more resources then clickhouse itself.

pradeepchhetriOP4y ago

Grafana Labs recently wrote a new clickhouse adapter.

https://github.com/grafana/clickhouse-datasource

https://grafana.com/blog/2022/05/05/introducing-the-official...

hodgesrm4y ago

https://github.com/Altinity/clickhouse-grafana/issues

pradeepchhetriOP4y ago

The vertamedia clickhouse plugin works perfectly for me. Have you tried opening an issue upstream.

gynther4y ago· 2 in thread

A very good article. Can anyone recommend more like this to learn more about clickhouse?

nemo44x4y ago

https://clickhouse.com/learn/