Demystifying the use of Parquet for time series (opens in new tab)

(blog.senx.io)

49 pointsfrancoismassot2y ago6 comments

6 comments

6 comments · 4 top-level

MrPowers2y ago· 2 in thread

Delta Lake solves a lot of the Parquet limitations mentioned in this post. Disclosure: I work on the Delta Lake project.

Parquet files store metadata about row groups in the file footer. Delta Lake adds file-level metadata in the transaction log. So Delta Lake can perform file-level skipping before even opening any of the Parquet files to get the row-group metadata.

Delta Lake allows you to rearrange your data to improve file-skipping. You can Z Order by timestamp for time-series analyses.

Delta Lake also allows for schema evolution, so you can evolve the schema of your table over time.

This company may have a cool file format, but is it closed source? It seems like enterprises don't want to be locked into closed formats anymore.

Malcolmlisk2y ago

Wow ! I've been reading for a while from delta lake and Im interested in the company. Is there a chance to drop a CV for remote work (i am from spain).

The schema evolution is something that popped out in a water cooler conversation the other day in my team.

adammarples2y ago

Can you z order in delta lake? I thought that was one of the features databricks had kept to themselves

speedgoose2y ago

This seems to be an ad for a proprietary time series data format named HFiles that is locked behind a "contact us".

Thanks but I will stay with Parquet for now.

stargrazer2y ago

This reminds me of HDF5, which, even thought the data is written/appended in row format, there is an API to chunk the data, organize into columns, compress based upon column regularities, and write to storage.

On reading the reverse happens.

This becomes the compute/space conundrum: space is reduced with column based regularity, but time is increased due to the extra overhead of columnar compression.

jononor2y ago

I find that Parquet works great for time series data in general. We use it a bunch for datasets of some millions of rows for each of 10-1000 devices, with typically 10 ish columns per device. What I am missing is good support for delta encdoing, which often leads to the best compression for time series data. The format specifies delta encoding for all integer types however I find it to be poorly supported in most Python-friendly libraries (fastpaquet/polars/duckdb).

j / k navigate · click thread line to collapse