Using Parquet's Bloom Filters (opens in new tab)

(influxdata.com)

53 pointspauldix2y ago7 comments

7 comments

6 comments · 2 top-level

darkflame912y ago· 4 in thread

With large datasets, wouldn't partitioning the data on low cardinality columns give the same benefit without the space overhead?

ithkuil2y ago

Define "same benefits".

Bloom filters allow you to prune the number of files you even have to look at, which matters a lot when there is a coat associated with scanning the files.

Partitioning the data can be advantageous both for pruning (on its sort keys) or for parallel query fan-out (you independently scan and apply the predicate to the high cardinality column in each partition concurrently).

In the use case that underpins the article, they want to minimize unnecessary access to parquet file data because it lives on a high latency storage system and the compute infrastructure is not meant to be scaled up to match the number of partitions. So they just want an index to help find the data in the high cardinality column

darkflame912y ago

Partitioning also prunes the number of files to be looked at. Only the directory structure of the files (or prefixes of the objects in S3) need to be checked.

1 more reply

alamb2y ago

In general, if you can partition your datasets on your predicate column, sorting is likely the best option

For example when you have a predicate like, `where id = 'fdhah-4311-ddsdd-222aa'` sorting on the `id` column will help

However, if you have predicates on multiple different sets of columns, such as another query on `state = 'MA'`, you can't pick an ideal sort order for all of them.

People often partition (sort) on the low cardinality columns first as that tends to improve compression signficantly

kentm2y ago

Yes. The bloom filters tend to be useful on queries that are doing lookups on specific values for high-cardinality columns. For example, "SELECT * FROM ... WHERE ids IN ('guid1', 'guid2', 'guid3', ...)". You could hash partition on the guid here but its likely that your filter criteria will cover a lot/all of the potential hashes, as opposed to the bloom filter.

appplication2y ago

One thing I have wondered: would it make sense to reduce file size? Generally advice I’ve seen is to keep files to around 250mb-1gb, but if you’re leaning heavily on bloom filters it feels like it could make sense to reduce the number of files to reduce the amount that would trigger the per-file filter.

j / k navigate · click thread line to collapse