> [Arrow] adoption will take time, and most people are probably more comfortable seeing NumPy arrays. Therefore a Vaex version 4 a DataFrame can hold both NumPy arrays and Apache Arrow arrays to make the transition period easier.
There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems. I didn't realize this transition impacted NumPy arrays in addition to Pandas dataframes in Python.
[1] https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-w...
Not, until there is proper multidimensional array support.
But as the person who needs to load up some data and do some transformations, this article gives me very little information about why I should switch from pandas.
But I am excited to hear about new solutions in the data frame space!
Vaex is stupid fast at all the data operations it supports to the point where I've used in it in place of a database for an API.
Indeed, for small data there is not much to gain, at least this is not the focus of this article. Although even with small amounts of the, the automatic pipelines are useful https://vaex.io/blog/ml-impossible-train-a-1-billion-sample-...
Does it make things hard that was easy in pandas or does it make things that are hard in pandas easy?
As a part of developing Modin, we identified a low-level algebra and data model that both generalizes and encompasses all of the pandas and R dataframe functionalities. Modin is an implementation of this data model and algebra[1]. Based on our studies, Vaex's architecture can support somewhere in the range of 35-40% of the pandas DataFrame API, including the exclusion of support for row indexes. Compare this to Dask, currently at 44% of the pandas API, and Modin, currently at 90%.
Vaex is great if you're already working with a compatible memory-mapped file format; it'll be exceptionally fast in that case. That is the use case I believe they are (successfully) targeting.
Has anyone tried this, or know if it's possible?