Another thing I like is that conceptually it seems like it would be simple to switch the underlying query engine (right now it’s Polars) in the future. Seems like a pretty general distributed system.
Quokka supports checkpointing but does not enable that by default to prevent this common problem from killing normal operation performance.
In principle, the programming language should not be the greatest consideration because developers can learn and use different languages for different applications. In practice, being able to draw on familiar syntax and libraries can make a real difference in usability.
That's like saying "a library to parse and optimize computer programs", except probably even harder, since a compiler and runtime library can't make any assumptions about the programs they need to make run, so they're limited in the potential of utilizing all that context information.
Countless person-years have been spent on this and it's still a very active fields of research and engineering.
> 2x SparkSQL performance
Ah, ok, so it can be slow. Never mind then, carry on :-P
SELECT * FROM (SELECT * FROM x) WHERE z = 1
can be optimized into
SELECT * FROM x where z = 1
Well, some of these transformations are useful, like the one you presented. But any non-trivial transformation may be either beneficial or detrimental, depending on a myriad of factors including memory layout, compression, distribution of data, computing hardware etc.
> Very fast kernels for SQL primitives like joins, filtering and aggregations. Quokka uses Polars to implement these. (I sponsor Polars on Github and you should too.) I am also exploring DuckDB, but I have found Polars to be faster so far.
I agree that it’s a little disingenuous to call it “pure Python” when the two libraries doing the heavy lifting are non-Python; but it’s not a lie that the entirety of the Quokka-specific codebase is Python.
Personally, what I would be more interested in (and what I thought this would be from the title) is a full SQL engine wholesale coded in Python, a la SQLite. Even if it wasn’t super performant or functional.
Here you go!
Not sure what the process looks like through the Python API. Maybe @ritchie46 can chime in?
Here are some examples of how to use it in python:
https://github.com/pola-rs/polars/blob/91a419acaf024e64410e7...
However, full sql support is on the roadmap. It's just a matter of hours in a day...
> After all, most ML in industry today seems to be lightweight models applied to heavily engineered features
I assume "lightweight models" are those that don't have too many parameters, and "heavily engineered features" mean that the data fed into the model has undergone significant pre-processing via potentially complicated UDFs -- hence the motivation for the project. Is that right?
> Quokka is an open-source push-based vectorized query engine ... it is meant to be much more performant than blocking-shuffle based alternatives like SparkSQL
Does anyone have pointers to what push-based vs blocking-shuffle engines are? Any good papers?
> It should work on local machine no problem (and should be a lot faster than Pandas!)
So I understand why Quokka is faster than Spark, but I'm a bit uncertain as to why the author is also making a comparison with Pandas on a single machine. Is it because the streaming pipeline design means that Quokka can better take advantage of multiple cores?
For push vs. pull, I'd recommend: https://news.ycombinator.com/item?id=27006476.
On single machine, you really should just use Polars. Quokka is faster than Pandas because it can take advantage of multiple cores, but so can Polars -- and it is likely to be faster.
The SQL optimizations like predicate pushdown and early projection are all there already in the dataframe API, similar to Polars.
Just reminds many of such projects like Lightworks:
> Editor's note: The intent for Lightworks to go open source seems to have been abandoned.
[0] https://opensource.com/business/12/10/lightworks-linux-devel...
I've seen several variants of "next-gen" spark, but nowhere have I really seen the different tradeoffs/advantages/disadvantages between them.
Secondly quokka tries to be fault tolerant. I.e. it can handle worker failures intra query and not have to start over. This is quite important in real world spark deployments with thousands of nodes running many hours. AFAIK this is not well supported by most spark alternatives.
Finally quokka has a much stronger focus on time series data analytics. It is meant to excel at workloads like range joins and asof/PIT joins used for feature engineering. (This part isn't too stable yet so is not open source)
This means quokka optimizes on a different point in the UDF/performance/fault tolerance tradeoff space than something like arrow ballista or starrocks, which I think are pure performance plays
I really do think a distributed db with compute/storage separation and optimized for feature engineering/dataloading (for training NNs) is underserved.
I'd be very interested in the time series aspects of what you're building.
I have a SQL Engine in Python too (https://github.com/mabel-dev/opteryx). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.
It might be worth running your benchmarks against Trino with fault tolerant execution mode enabled. Check the documentation here: https://trino.io/docs/current/admin/fault-tolerant-execution...
Adding fault tolerant to execution to Trino was a big and complicated project for anyone interested in more details check here: https://trino.io/blog/2022/05/05/tardigrade-launch.html
~One suggestion, a Scala/Java Spark run of those benchmarks should be a valid baseline to compare against as well instead of PySpark.~ Ah it's SparkSQL so the execution probably wouldn't have much of py4j involvement, except for the collect.
https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-...
Something like building a toy dynamodb variant. Would pay good money for this.
Is one thing to read about and guesstimate implementation choices from white paper. Totally another if databases/distributed systems expert walks you through propagating write ahead logs to replicas. Hehe
And what it means to have eventual consistent writes vs strong writes in practice. A way to teach by doing.
I prefer reading a description of an algorithm than the code of the algorithm. From the description I can work out how to implement the code myself. I might look at someone's code for ideas or compare how I solved a problem.
I too would enjoy a whitepaper of database design. I document all whitepapers read on GitHub profile (see my HN profile)
I have a simple btree here https://GitHub.com/samsquire/btree
Farley Knight added an AVL tree to the repository that I need to change the unbalanced tree to use it.
I would like to know how SQLite VM works so I can write a database VM. I learned some things from crafting interpreters book and started writing my own programming language here
https://GitHub.com/samsquire/multiversion-concurrency-contro...
Casual README slip of the year...
> When I set out, I had several objectives:
> Easy to install and run, especially for distributed deployments.
> [...]
> The first two objectives strongly scream Python as the language of choice for Quokka.
Python is probably one of the last languages I'd consider if ease of deployment is a priority. Packaging has historically been a mess, and deploying standalone binaries across platforms is a pain. State of the art solutions are 3rd party and involve bundling the intepreter for each platform. It's been a few years since I last used it for anything serious, but I believe this is still the case.
Whereas something like Go actually makes this infinitely easier, for both the developer and the user. One native Go command builds a standalone binary for each platform. It couldn't be simpler.
The other objective of supporting Python UDFs necessarily ties you to Python. And since this is solving a data science problem, it makes sense for it to be written in Python.