As a performance-driven project it’s important for us to understand which operations and use-cases are slowest/buggiest for our users so that we can focus on them. We tried to be very intentional in scoping the telemetry we collect and take this very seriously (telemetry is top-level on both our docs and README).
Happy to hear any feedback on this - we understand it's an important topic.
[Edit: parent link was fixed, thanks! :)]
"...consent options structured as an opt-out selected by default is a violation of the GDPR..."
https://en.wikipedia.org/wiki/General_Data_Protection_Regula...
I don't have strong opinions either way in this. But it's usually something that sparks a big flaming thread.
If we set apart the distributed part, what's the "killer feature" of Daft for trying to compete with Polars (and Pandas) ? Are they API-compatible ? How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark)
I am not focused on complex data types though.
We did actually start by using Polars as our underlying execution engine, but eventually transitioned off to our own Rust Table abstraction to better suit our needs (e.g. custom datatypes and kernels). We still share the arrow2 dependency with Polars for in-memory representation of our data.
> what's the "killer feature" of Daft for trying to compete with Polars (and Pandas)
We don't specifically try to compete on a local machine since there is so much good new tooling being made recently available (DuckDB, Polars etc). That being said, we do try our best to make the local experience as seamless as possible because we've all felt the pain of developing locally in PySpark. Aside from the ability to go distributed, I like using Daft for:
* Working on more "complex" datatypes such as URLs, images etc
* Working with "many [Parquet/CSV/JSON] files in cloud storage (S3)" which we've found to be quite common for many workloads. We already have some and are building more intelligent optimizations here such as column pruning, predicate pushdowns etc to reduce and optimize I/O from the cloud.
As you've pointed out, one of our main responsibilities is to handle memory very very well. This is something we're actively working on and I'm thinking this will be a big reason to use us locally as well!
> Are they API-compatible?
We are not API-compatible with Pandas/Polars, but our API is quite inspired by Polars. We found that building out the core set of dataframe functionality was much more tractable than attempting to go API-compatible from the get-go.
> How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark.
We think throughput is still important when comparing against Spark, since this can save a lot of money when running some potentially very expensive queries!
That being said, you're spot-on about memory usage being a key metric here. One of the key advantages of having native types for multimodal data (e.g. tensors and images) is that we can much more tightly account for memory requirements when working with these types, beyond the usual black-box Python UDF which often results in a ton of out-of-memory issues.
Our current mechanism for dealing with this is relying on Ray's excellent object spilling mechanisms for working with out-of-core data. We recognize that there are many situations in which this is insufficient.
The team is working on many advanced features here (e.g. microbatching) that will give Daft a big boost, and will release benchmarks as soon as we have them!
[Edit: typos!]
I'm going to be very blunt here, because you need to hear this to go forward :
You HAVE TO be at least API-compatible with Polars or Pandas to exist. Being backend-compatible with arrow is not enough.
There is no technical reason why you would not pick one and go with it, apart from being a very difficult task.
As of today, I have 2 major pains : Pandas being a giant memory hog and Polars not being a drop-in replacement for Pandas. I am pushing Polars, as hard as I can in all projects I can touch, and it's a very long way from being the default DataFrame library. Data Scientists will continue to use Pandas for the foreseeable future, and that saddens me greatly because I will also have to work with OOMKilled pipelines for the foreseeable future.
There is no place for a 3rd alternative, so either you become a "distribution bridge for Polars", and that would be absolutely amazing. Or, you go your own way, I'll put a small star on github, a "Noice!" in the comment section and move on and never come back.
It's tough, but sadly real.
* I started poking around the docs and I'm most excited about a Ray backend runner. I'm hoping this allows more ergonomic distributed data frame computation on an existing Ray cluster.
* Is this based on Apache Arrow? I would assume so, but it's important that it be zero-copy from other tools. Would like to see this mentioned prominently somewhere.
* I really like the Polars expression API. I haven't dug into the docs enough to know how it relates. I do see a reference to pinning the embedded Polars version, so fingers crossed there's compatibility. It would be AMAZING to be able to take Polars code and run it in a distributed cluster with minimal changes. Can anyone chime in?
Yes, give it a whirl and let us know what you think! Ray is amazing and has actually gotten a lot better post their 2.0 release :)
> Is this based on Apache Arrow?
Indeed it is, and thanks for the feedback. We'll make this a little more visible. We use the arrow2 Rust crate (same one that Polars uses) for our in-memory data representation.
Our data representation makes it such that converting Daft into a Ray dataset (`df.to_ray_dataset()`) is actually zero-copy. So you can go from data transformations into downstream ML stuff in Ray really easily.
> It would be AMAZING to be able to take Polars code and run it in a distributed cluster with minimal changes.
Unfortunately we don't have Polars API compatibility. This seems to be a recurring theme in this thread though. The problem is that certain Polars expressions are non-trivial to do in a distributed setting, and Polars itself as a project is so young and moves so quickly it's hard for us to maintain 100% API-compatibility.
That being said, you are correct that a lot of the API is very much inspired by Polars, which should hopefully make it easy to move between the two.
2. I admit to not understanding data lakes at all. I thought it was like a failure case for like, "we can't figure out how to get this data into a database", because isn't updating it a huge chore? You have to make sure that if you're updating you're not also generating new analytics, which it seems like you're always doing because it's very slow. Don't databases solve this pretty elegantly? Why are there all these tools for dealing with data in flat files?
1. Thanks! We think so too :)
2. Here's my 2c in argument of flat files
- Ingestion: ingesting things into a data lake is much easier than writing to a database (all you have to do is drop some JSON, CSVs or protobufs into a bucket). This makes integrating with other systems, especially 3rd-party or vendors, much easier since there's an open language-agnostic format to communicate with.
- Multimodal data: Certain datatypes (e.g. images, tensors) may not make sense in a traditional SQL database. In a datalake though, data is usually "schema-on-read", so you can at least ingest it and now the responsibility is on the downstream application to make use of it if it can/wants to - super flexible!
- "Always on": with databases, you pay for uptime which likely scales with the size of your data. If your requirements are infrequent accesses of your data then a datalake could save you a lot of money! A common example of this: once-a-day data cleanups and ETL of an aggregated subset of your data into downstream (clean!) databases for cheaper consumption.
On "isn't updating it a huge chore?": many data lakes are partitioned by ingestion time, and applications usually consume a subset of these partitions (e.g. all data over the past week). In practice this means that you can lifecycle your data and put old data into cold-storage so that it costs you less money.
I’m no data scientist, and have only worked with data lakes a couple times, but I can see why data science tends to be done with very predictable (if inefficient) data formats such as CSV, JSON, and JSONL.
Edit: SQLite is the best of both worlds. It’s a database, but it’s also “just a file.” It’s easy to work with, and many languages & frameworks are getting good support for it. SQLite’s reliability-first approach means many of the kinks that arise from involving databases (so much complexity!!) are ironed out and don’t arise as issues. (Things like auto-indexing, auto-vacuuming, avoiding & dealing with corruption, backwards & forwards compatibility, …)
I have users demanding Iceberg writes and Hudi reads/writes. I don't know what to tell them, since I don't have the resources to add a reader/writer myself for those projects.
Hopefully as DuckDB becomes more popular we will see Python bindings for these popular data lake formats this year.
The other main points are usually
* Data size
* Data access patterns
* Data formats
The more you're looking at "I want to pull 400G of data out of my 30TB set of images from a bunch of machines running a custom python script, then shut it down in twenty minutes and not start anything else until tomorrow" then the more a data lake makes sense vs a database.
> because isn't updating it a huge chore?
Not with the right tools, which can also give you things like a git-like commit experience with branching.
> You have to make sure that if you're updating you're not also generating new analytics, which it seems like you're always doing because it's very slow.
Why would you be generating new analytics? I feel I've missed something there.
Not yet, it’s on our todo list to integrate with the ecosystem of data catalogs (Iceberg/Delta/Hudi etc). Join our Slack/get in touch with us if you’re keen on this though, we’d love to learn more about your use-case!
> Any plans to support sql queries?
We do eventually want to support SQL as well, but haven’t had the bandwidth to build and maintain it. Really we’d just need to compile the SQL down to our logical plan - we could pretty easily integrate UDFs so that they can be registered as SQL functions too!
I work on Quokka (https://github.com/marsupialtail/quokka). I support Iceberg reads. Recently we are adding SQL support from just parsing the DuckDB logical plan, though that is very challenging as well.
The Python world lacks a standard for a plug and play SQL query optimizer. Apache Calcite is good for the JVM world, but not great if you are trying to cut out the JVM.