I'm tempted to benchmark this against better known tech, maybe anyone has some insight to share?
And a 4 node Spark cluster with Parquet or Arrow files and a Scala job will compare favorably to this because it's also a lazy evaluator and the benchmark problems here are embarrassingly parallel.
PS this is a very cool project!
I think they solve a different problem, smaller data, data/relational integrity, data mutations. Dataframes can take shortcuts here.
If you want to do a serious benchmark, feel free to contact us. Github: https://github.com/vaexio/vaex/issues Email: (I'm easy to google).
It would indeed be interesting to see how this approach with Vaex compares to Postgres. Though, I would be quite sad giving up SQL in favour of Pandas DataFrame indexing and Python looping :)
We are also working on GraphQL support, with a Hasura-like API: https://docs.vaex.io/en/latest/example_graphql.html
I think GraphQL is easier in combinations with front end development, and you can tab-complete your way out. Early days for this sub-project, but I think very promising.
It feels like growing wings and a jetpack. Almost everthing is waaay easier and faster.
Is there any way around that?
Yes, anything with one function call per item is going to be slow. That, along with memory inefficiency of lists, is the reason why numpy exists.
Correction - 99.97% of cash tips don't go recorded. Why pay tax if they never saw it happen?
Compare this to:
> 3.00% of the passengers that pay by card do not leave tips.
(The author of the article).
Xarray is more about nd-arrays, less about ~tabular data.
Vaex is built from the ground up with the idea that you can never copy the dataset (1TB dataset was quite common). We also never needed distributed computing, because it was always fast enough, and thus never had to use dask (although we're eager to support it fully).
Also, vaex is lazy but tries to hide it from you. For instance, if you add a new column to your dataframe, it will only be computed when needed (taking up 0 memory). However, in practice, you're not really aware of that. This means it feels more like pandas (immediate results) than dask.dataframe (no .compute()/.persist() needed).
I would say they are all complementary, with small amount of overlap. Small data: use Pandas. Out of memory error: move to vaex. Crazy amount of data (100TB?) that will never fit onto 1 computer: dask.dataframe, or help us implement full dask support.