https://docs.pola.rs/api/python/stable/reference/sql/python_...
https://docs.pola.rs/api/python/stable/reference/sql/python_...
To be fair, it can nearly always be done in SQL also (unless it’s ML or some Python-specific thing like that), but the SQL with nested queries and numerous CTEs is harder for me to wrap my brain around.
If I were betting, I’d pick DuckDB, because DuckDB seems more able to implement something Polars-like, than Polars is to implement something DuckDB-like.
https://duckdb.org/docs/api/python/spark_api
And while on the subject of syntax, duckdb also has function chaining
https://duckdb.org/docs/sql/functions/overview.html#function...
To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.
I had some spare time and tinkered with duckdb with a 70GB dataset, but just getting the 70GB on to the EC2 took hours. Would be pretty rocking if duckdb team could somehow set up a ~1TB sized demo that anyone can setup and try for themselves in, say, under an hour.
Reading the data off s3 will mean you will be slower than offerings like snowflake. Snowflake has optimized the crap out of doing analytics in s3, so you can’t beat it with something as simple as duckdb.
Importantly you need the data in some distributed format like parquet or split csv. Otherwise duckdb can’t read it in parallel.
For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it's easier for iteration/testing on a given aggregation to add/remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas/polars in SQL/DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.
https://duckdb.org/docs/sql/query_syntax/with#cte-materializ...
Obviously, the materialization is gone after the query has ended, but still a very powerful and useful directive to add to some queries.
There are also a few DuckDB extensions for pipeline SQL languages.
https://duckdb.org/community_extensions/extensions/prql.html
https://duckdb.org/community_extensions/extensions/psql.html
And of course dbt-duckdb https://github.com/duckdb/dbt-duckdb
If your workloads require fast writes and reads, SQLite will probably work fine.
If you're looking to run analytic, columnar queries (which tend to involve a lot of aggregation and joins on a few columns (say less than 50) at a time), then DuckDB is way more optimized.
Oversimplifying, Sqlite is more OLTP and DuckDB is more OLAP.
https://duckdb.org/docs/extensions/sqlite.html https://duckdb.org/docs/guides/database_integration/sqlite.h...
Thus potentially making duckdb an HTAP-like option.
https://duckdb.org/docs/sql/statements/create_macro#overload...
There are also other ways to use both.
https://duckdb.org/docs/guides/python/polars.html
All of this dataframe compatibility is awesome. (much thanks to Arrow and others)
https://motherduck.com/blog/differential-storage-building-bl...
Hopefully it finds its way into duckdb's repo some day.
They didn't write it to be some novel research, some canonical tutorial about the tech, or to teach/amuse each and every random reader.
It kinda did and it kinda didn't. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd/mm/yyyy.
I think a warning from DuckDB would have been in order.