Daft: A High-Performance Distributed Dataframe Library for Multimodal Data (opens in new tab)

(blog.getdaft.io)

119 pointsskagenpilot2y ago36 comments

36 comments

From past experience, the hackernews crowd is super hostile to tools/etc with opt-out [telemetry](https://www.getdaft.io/projects/docs/en/latest/telemetry.htm...).

jaychia2y ago

We hear you, and thanks for making this visible!

As a performance-driven project it’s important for us to understand which operations and use-cases are slowest/buggiest for our users so that we can focus on them. We tried to be very intentional in scoping the telemetry we collect and take this very seriously (telemetry is top-level on both our docs and README).

Happy to hear any feedback on this - we understand it's an important topic.

[Edit: parent link was fixed, thanks! :)]

mdaniel2y ago

Instead of having your own vanity opt-out, please do consider https://consoledonottrack.com/ which allows the user to express their intentions without their bashrc growing with every new project

1 more reply

zekrioca2y ago

Instead of having your own opt-out, why don't you simply ask your users? Do a simple survey after some time a user is using your product. Offer them a cup of coffee, and they will gladly send you some real, genuine feedback.

nemoniac2y ago

Since you ask, some feedback: opt-out on telemetry is in contravention of EU law. It applies when any EU citizen uses your software.

"...consent options structured as an opt-out selected by default is a violation of the GDPR..."

https://en.wikipedia.org/wiki/General_Data_Protection_Regula...

silveraxe932y ago

tbh it _was_ quite easy to spot. The gold standard would be making it opt-in, but I can guess barely no-one would enable it.

I don't have strong opinions either way in this. But it's usually something that sparks a big flaming thread.

Fiahil2y ago

So Daft is a distributed Polars ?

If we set apart the distributed part, what's the "killer feature" of Daft for trying to compete with Polars (and Pandas) ? Are they API-compatible ? How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark)

marsupialtail_22y ago

Hi -- I am the author of Quokka: https://github.com/marsupialtail/quokka, trying to be distributed Polars. I am trying to go for API compatibility, or at least supporting most of the API.

I am not focused on complex data types though.

esafak2y ago

Same suggestion to you: please integrate it with fugue.

https://github.com/fugue-project/fugue

jaychia2y ago

> So Daft is a distributed Polars ?

We did actually start by using Polars as our underlying execution engine, but eventually transitioned off to our own Rust Table abstraction to better suit our needs (e.g. custom datatypes and kernels). We still share the arrow2 dependency with Polars for in-memory representation of our data.

> what's the "killer feature" of Daft for trying to compete with Polars (and Pandas)

We don't specifically try to compete on a local machine since there is so much good new tooling being made recently available (DuckDB, Polars etc). That being said, we do try our best to make the local experience as seamless as possible because we've all felt the pain of developing locally in PySpark. Aside from the ability to go distributed, I like using Daft for:

* Working on more "complex" datatypes such as URLs, images etc

* Working with "many [Parquet/CSV/JSON] files in cloud storage (S3)" which we've found to be quite common for many workloads. We already have some and are building more intelligent optimizations here such as column pruning, predicate pushdowns etc to reduce and optimize I/O from the cloud.

As you've pointed out, one of our main responsibilities is to handle memory very very well. This is something we're actively working on and I'm thinking this will be a big reason to use us locally as well!

> Are they API-compatible?

We are not API-compatible with Pandas/Polars, but our API is quite inspired by Polars. We found that building out the core set of dataframe functionality was much more tractable than attempting to go API-compatible from the get-go.

> How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark.

We think throughput is still important when comparing against Spark, since this can save a lot of money when running some potentially very expensive queries!

That being said, you're spot-on about memory usage being a key metric here. One of the key advantages of having native types for multimodal data (e.g. tensors and images) is that we can much more tightly account for memory requirements when working with these types, beyond the usual black-box Python UDF which often results in a ton of out-of-memory issues.

Our current mechanism for dealing with this is relying on Ray's excellent object spilling mechanisms for working with out-of-core data. We recognize that there are many situations in which this is insufficient.

The team is working on many advanced features here (e.g. microbatching) that will give Daft a big boost, and will release benchmarks as soon as we have them!

[Edit: typos!]

Fiahil2y ago

Thanks you for your responses !

I'm going to be very blunt here, because you need to hear this to go forward :

You HAVE TO be at least API-compatible with Polars or Pandas to exist. Being backend-compatible with arrow is not enough.

There is no technical reason why you would not pick one and go with it, apart from being a very difficult task.

As of today, I have 2 major pains : Pandas being a giant memory hog and Polars not being a drop-in replacement for Pandas. I am pushing Polars, as hard as I can in all projects I can touch, and it's a very long way from being the default DataFrame library. Data Scientists will continue to use Pandas for the foreseeable future, and that saddens me greatly because I will also have to work with OOMKilled pipelines for the foreseeable future.

There is no place for a 3rd alternative, so either you become a "distribution bridge for Polars", and that would be absolutely amazing. Or, you go your own way, I'll put a small star on github, a "Noice!" in the comment section and move on and never come back.

It's tough, but sadly real.

2 more replies

sirfz2y ago

Dolars

liminal2y ago

This looks amazing.

* I started poking around the docs and I'm most excited about a Ray backend runner. I'm hoping this allows more ergonomic distributed data frame computation on an existing Ray cluster.

* Is this based on Apache Arrow? I would assume so, but it's important that it be zero-copy from other tools. Would like to see this mentioned prominently somewhere.

* I really like the Polars expression API. I haven't dug into the docs enough to know how it relates. I do see a reference to pinning the embedded Polars version, so fingers crossed there's compatibility. It would be AMAZING to be able to take Polars code and run it in a distributed cluster with minimal changes. Can anyone chime in?

jaychia2y ago

> Ray backend runner

Yes, give it a whirl and let us know what you think! Ray is amazing and has actually gotten a lot better post their 2.0 release :)

> Is this based on Apache Arrow?

Indeed it is, and thanks for the feedback. We'll make this a little more visible. We use the arrow2 Rust crate (same one that Polars uses) for our in-memory data representation.

Our data representation makes it such that converting Daft into a Ray dataset (`df.to_ray_dataset()`) is actually zero-copy. So you can go from data transformations into downstream ML stuff in Ray really easily.

> It would be AMAZING to be able to take Polars code and run it in a distributed cluster with minimal changes.

Unfortunately we don't have Polars API compatibility. This seems to be a recurring theme in this thread though. The problem is that certain Polars expressions are non-trivial to do in a distributed setting, and Polars itself as a project is so young and moves so quickly it's hard for us to maintain 100% API-compatibility.

That being said, you are correct that a lot of the API is very much inspired by Polars, which should hopefully make it easy to move between the two.

liminal2y ago

Please at least reach out to the Polars folks and see what's possible.

camgunz2y ago

1. This looks super cool

2. I admit to not understanding data lakes at all. I thought it was like a failure case for like, "we can't figure out how to get this data into a database", because isn't updating it a huge chore? You have to make sure that if you're updating you're not also generating new analytics, which it seems like you're always doing because it's very slow. Don't databases solve this pretty elegantly? Why are there all these tools for dealing with data in flat files?

jaychia2y ago

Hi, I'm one of the maintainers of Daft

1. Thanks! We think so too :)

2. Here's my 2c in argument of flat files

- Ingestion: ingesting things into a data lake is much easier than writing to a database (all you have to do is drop some JSON, CSVs or protobufs into a bucket). This makes integrating with other systems, especially 3rd-party or vendors, much easier since there's an open language-agnostic format to communicate with.

- Multimodal data: Certain datatypes (e.g. images, tensors) may not make sense in a traditional SQL database. In a datalake though, data is usually "schema-on-read", so you can at least ingest it and now the responsibility is on the downstream application to make use of it if it can/wants to - super flexible!

- "Always on": with databases, you pay for uptime which likely scales with the size of your data. If your requirements are infrequent accesses of your data then a datalake could save you a lot of money! A common example of this: once-a-day data cleanups and ETL of an aggregated subset of your data into downstream (clean!) databases for cheaper consumption.

On "isn't updating it a huge chore?": many data lakes are partitioned by ingestion time, and applications usually consume a subset of these partitions (e.g. all data over the past week). In practice this means that you can lifecycle your data and put old data into cold-storage so that it costs you less money.

camgunz2y ago

Heya thanks for the super informative response! I learned a lot here it was very useful. Good luck with Daft!

DecoPerson2y ago

A database is typically an alive, running program that requires maintenance. There is a strong coupling for most “databases” between disk format and executable code. It’s not easy to read from a random Postgres database sitting on disk. You could not do in Python, “import postgres” and then “postgres.open(‘mydb’)”.

I’m no data scientist, and have only worked with data lakes a couple times, but I can see why data science tends to be done with very predictable (if inefficient) data formats such as CSV, JSON, and JSONL.

Edit: SQLite is the best of both worlds. It’s a database, but it’s also “just a file.” It’s easy to work with, and many languages & frameworks are getting good support for it. SQLite’s reliability-first approach means many of the kinks that arise from involving databases (so much complexity!!) are ironed out and don’t arise as issues. (Things like auto-indexing, auto-vacuuming, avoiding & dealing with corruption, backwards & forwards compatibility, …)

marsupialtail_22y ago

While we are on this topic, the challenge with data lakes for Python based projects like Daft and Quokka (what I work on) is the poor Python support for data lakes like Delta, Iceberg and Hudi. Delta has the best support but its Python API is consistently behind the Java ones. Iceberg doesn't support Python writes. Hudi doesn't support anything Python.

I have users demanding Iceberg writes and Hudi reads/writes. I don't know what to tell them, since I don't have the resources to add a reader/writer myself for those projects.

Hopefully as DuckDB becomes more popular we will see Python bindings for these popular data lake formats this year.

IanCal2y ago

One side of data lakes is it's more of a start of where the data is. Your normalised and more processed data may end up in a nice clean database but the start is not like that.

The other main points are usually

* Data size

* Data access patterns

* Data formats

The more you're looking at "I want to pull 400G of data out of my 30TB set of images from a bunch of machines running a custom python script, then shut it down in twenty minutes and not start anything else until tomorrow" then the more a data lake makes sense vs a database.

> because isn't updating it a huge chore?

Not with the right tools, which can also give you things like a git-like commit experience with branching.

> You have to make sure that if you're updating you're not also generating new analytics, which it seems like you're always doing because it's very slow.

Why would you be generating new analytics? I feel I've missed something there.

rgrieselhuber2y ago

2. I hate the name data lake but I’ve always used them as part of a pipeline. It’s useful to keep the raw data around when you need to recover / replay the pipeline.

short_sells_poo2y ago

I think most data lakes really turn into a data swamp as people/orgs are not diligent enough to keep them in a good shape. In absence of costs, people never delete anything voluntarily and the garbage will grow monotonically until it takes up all space.

1 more reply

esafak2y ago

Please integrate it with Fugue; a unified interface that lets you swap out execution engines like pandas for Spark.

https://github.com/fugue-project/fugue

sammysidhu2y ago

(one of the Daft maintainers here) Great call out! I went ahead and make an issue for us to work on this: https://github.com/Eventual-Inc/Daft/issues/1016

vvladymyrov2y ago

Does Daft support Delta table format? Any plans to support sql queries?

jaychia2y ago

> Does Daft support Delta table format?

Not yet, it’s on our todo list to integrate with the ecosystem of data catalogs (Iceberg/Delta/Hudi etc). Join our Slack/get in touch with us if you’re keen on this though, we’d love to learn more about your use-case!

> Any plans to support sql queries?

We do eventually want to support SQL as well, but haven’t had the bandwidth to build and maintain it. Really we’d just need to compile the SQL down to our logical plan - we could pretty easily integrate UDFs so that they can be registered as SQL functions too!

marsupialtail_22y ago

SQL support is very challenging.

I work on Quokka (https://github.com/marsupialtail/quokka). I support Iceberg reads. Recently we are adding SQL support from just parsing the DuckDB logical plan, though that is very challenging as well.

The Python world lacks a standard for a plug and play SQL query optimizer. Apache Calcite is good for the JVM world, but not great if you are trying to cut out the JVM.

swader9992y ago

Finally a decent name.

jaychia2y ago

Hello! I am one of the maintainers of Daft. Funny enough I just gave a presentation about Daft in London and we all had quite a laugh at the name :D

IanOzsvald2y ago

Indeed I was the one who got confused by the name! Thanks for attending the discussion Jay and I'm happy to see Daft being discussed here

nhourcard2y ago

one of the best bands in the world has Daft in their name!

j / k navigate · click thread line to collapse

36 comments

silveraxe932y ago

From past experience, the hackernews crowd is super hostile to tools/etc with opt-out [telemetry](https://www.getdaft.io/projects/docs/en/latest/telemetry.htm...).

jaychia2y ago

We hear you, and thanks for making this visible!

Happy to hear any feedback on this - we understand it's an important topic.

[Edit: parent link was fixed, thanks! :)]

mdaniel2y ago

Instead of having your own vanity opt-out, please do consider https://consoledonottrack.com/ which allows the user to express their intentions without their bashrc growing with every new project

1 more reply

zekrioca2y ago

nemoniac2y ago

Since you ask, some feedback: opt-out on telemetry is in contravention of EU law. It applies when any EU citizen uses your software.

"...consent options structured as an opt-out selected by default is a violation of the GDPR..."

https://en.wikipedia.org/wiki/General_Data_Protection_Regula...

silveraxe932y ago

tbh it _was_ quite easy to spot. The gold standard would be making it opt-in, but I can guess barely no-one would enable it.

I don't have strong opinions either way in this. But it's usually something that sparks a big flaming thread.

Fiahil2y ago

So Daft is a distributed Polars ?

marsupialtail_22y ago

Hi -- I am the author of Quokka: https://github.com/marsupialtail/quokka, trying to be distributed Polars. I am trying to go for API compatibility, or at least supporting most of the API.

I am not focused on complex data types though.

esafak2y ago

Same suggestion to you: please integrate it with fugue.

https://github.com/fugue-project/fugue

jaychia2y ago

> So Daft is a distributed Polars ?

> what's the "killer feature" of Daft for trying to compete with Polars (and Pandas)

* Working on more "complex" datatypes such as URLs, images etc

> Are they API-compatible?

> How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark.

We think throughput is still important when comparing against Spark, since this can save a lot of money when running some potentially very expensive queries!

The team is working on many advanced features here (e.g. microbatching) that will give Daft a big boost, and will release benchmarks as soon as we have them!

[Edit: typos!]

Fiahil2y ago

Thanks you for your responses !

I'm going to be very blunt here, because you need to hear this to go forward :

You HAVE TO be at least API-compatible with Polars or Pandas to exist. Being backend-compatible with arrow is not enough.

There is no technical reason why you would not pick one and go with it, apart from being a very difficult task.

It's tough, but sadly real.

2 more replies

sirfz2y ago

Dolars

liminal2y ago

This looks amazing.

* I started poking around the docs and I'm most excited about a Ray backend runner. I'm hoping this allows more ergonomic distributed data frame computation on an existing Ray cluster.

* Is this based on Apache Arrow? I would assume so, but it's important that it be zero-copy from other tools. Would like to see this mentioned prominently somewhere.

jaychia2y ago

> Ray backend runner

Yes, give it a whirl and let us know what you think! Ray is amazing and has actually gotten a lot better post their 2.0 release :)

> Is this based on Apache Arrow?

Indeed it is, and thanks for the feedback. We'll make this a little more visible. We use the arrow2 Rust crate (same one that Polars uses) for our in-memory data representation.

> It would be AMAZING to be able to take Polars code and run it in a distributed cluster with minimal changes.

That being said, you are correct that a lot of the API is very much inspired by Polars, which should hopefully make it easy to move between the two.

liminal2y ago

Please at least reach out to the Polars folks and see what's possible.

camgunz2y ago

1. This looks super cool

jaychia2y ago

Hi, I'm one of the maintainers of Daft

1. Thanks! We think so too :)

2. Here's my 2c in argument of flat files

camgunz2y ago

Heya thanks for the super informative response! I learned a lot here it was very useful. Good luck with Daft!

DecoPerson2y ago

marsupialtail_22y ago

I have users demanding Iceberg writes and Hudi reads/writes. I don't know what to tell them, since I don't have the resources to add a reader/writer myself for those projects.

Hopefully as DuckDB becomes more popular we will see Python bindings for these popular data lake formats this year.

IanCal2y ago

One side of data lakes is it's more of a start of where the data is. Your normalised and more processed data may end up in a nice clean database but the start is not like that.

The other main points are usually

* Data size

* Data access patterns

* Data formats

> because isn't updating it a huge chore?

Not with the right tools, which can also give you things like a git-like commit experience with branching.

> You have to make sure that if you're updating you're not also generating new analytics, which it seems like you're always doing because it's very slow.

Why would you be generating new analytics? I feel I've missed something there.

rgrieselhuber2y ago

2. I hate the name data lake but I’ve always used them as part of a pipeline. It’s useful to keep the raw data around when you need to recover / replay the pipeline.

short_sells_poo2y ago

1 more reply

esafak2y ago

Please integrate it with Fugue; a unified interface that lets you swap out execution engines like pandas for Spark.

https://github.com/fugue-project/fugue

sammysidhu2y ago

(one of the Daft maintainers here) Great call out! I went ahead and make an issue for us to work on this: https://github.com/Eventual-Inc/Daft/issues/1016

vvladymyrov2y ago

Does Daft support Delta table format? Any plans to support sql queries?

jaychia2y ago

> Does Daft support Delta table format?

> Any plans to support sql queries?

marsupialtail_22y ago

SQL support is very challenging.

The Python world lacks a standard for a plug and play SQL query optimizer. Apache Calcite is good for the JVM world, but not great if you are trying to cut out the JVM.

swader9992y ago

Finally a decent name.

jaychia2y ago

Hello! I am one of the maintainers of Daft. Funny enough I just gave a presentation about Daft in London and we all had quite a laugh at the name :D

IanOzsvald2y ago

Indeed I was the one who got confused by the name! Thanks for attending the discussion Jay and I'm happy to see Daft being discussed here

nhourcard2y ago

one of the best bands in the world has Daft in their name!

j / k navigate · click thread line to collapse