A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0 (opens in new tab)

(vaex.io)

58 pointsmaartenbreddels5y ago16 comments

16 comments

15 comments · 5 top-level

ZeroCool2u5y ago· 4 in thread

How does Vaex compare with Modin[1]?

[1]: https://modin.readthedocs.io/en/latest/index.html

I'm one the maintainers of Modin, so I can chime in here. Dataframes are the focus of my PhD thesis, and Modin started as my PhD project. Most of the differences come down to functionality and support. Truthfully, the goals of the projects are quite different so it's a bit of apples-to-oranges.

As a part of developing Modin, we identified a low-level algebra and data model that both generalizes and encompasses all of the pandas and R dataframe functionalities. Modin is an implementation of this data model and algebra[1]. Based on our studies, Vaex's architecture can support somewhere in the range of 35-40% of the pandas DataFrame API, including the exclusion of support for row indexes. Compare this to Dask, currently at 44% of the pandas API, and Modin, currently at 90%.

Vaex is great if you're already working with a compatible memory-mapped file format; it'll be exceptionally fast in that case. That is the use case I believe they are (successfully) targeting.

[1] https://arxiv.org/pdf/2001.00888

ZeroCool2u5y ago

Got it, that's really helpful. Thank you for clarifying and all your hard work on Modin!

maartenbreddelsOP5y ago

AFAIK Modin tries to be the API compatible with Pandas, but then faster/distributed. Vaex tries to be a DataFrame library that is as fast as possible on a single machine to keep things simple and fast (although distributed is on the horizon, we don't need it currently). We're not afraid to break compatibility with pandas, because we care about performance. Both libraries try to hide the laziness from the user.

ZeroCool2u5y ago

Got it, that makes sense! Will give it a try in the future for sure!

wodenokoto5y ago· 3 in thread

It seems like it has some nice advanced features that the data engineering team might appreciate once an application gets large.

But as the person who needs to load up some data and do some transformations, this article gives me very little information about why I should switch from pandas.

But I am excited to hear about new solutions in the data frame space!

musingsole5y ago

If you're able to comfortably do your processing in Pandas, I don't think there is any justification to switch to Vaex. But Pandas begins to strain in the GB territory. If you switch to Vaex at that point, it'll be night and day. Working from the REPL, no more half second pauses for results. And of course the payoff only grows with more data.

Vaex is stupid fast at all the data operations it supports to the point where I've used in it in place of a database for an API.

maartenbreddelsOP5y ago

Thanks, glad you find Vaex useful.

Indeed, for small data there is not much to gain, at least this is not the focus of this article. Although even with small amounts of the, the automatic pipelines are useful https://vaex.io/blog/ml-impossible-train-a-1-billion-sample-...

wodenokoto5y ago

I was more concerned about its api / methods.

Does it make things hard that was easy in pandas or does it make things that are hard in pandas easy?

1 more reply

sradman5y ago· 1 in thread

Why this hybrid dataframe? [1]:

> [Arrow] adoption will take time, and most people are probably more comfortable seeing NumPy arrays. Therefore a Vaex version 4 a DataFrame can hold both NumPy arrays and Apache Arrow arrays to make the transition period easier.

There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems. I didn't realize this transition impacted NumPy arrays in addition to Pandas dataframes in Python.

[1] https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-w...

nomel5y ago

> There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems

Not, until there is proper multidimensional array support.

claytonjy5y ago· 1 in thread

A killer use-case here might be interop with Delta Lake; you could allow data scientists to work with the data there (parquet in S3) in a local-like manner, using an API that might be preferable to Spark's!

Has anyone tried this, or know if it's possible?

maartenbreddelsOP5y ago

My guess is that should be possible, feel free to hop onto https://github.com/vaexio/vaex/discussions !

liminal5y ago· 1 in thread

Did something important change in Arrow 3.0 in terms of working with text data? Is Arrow good for text data in general? I think I'm missing some context needed in order to be impressed by this post

maartenbreddelsOP5y ago

Yes, Arrow 3.0 has much more string kernels https://arrow.apache.org/docs/cpp/compute.html

j / k navigate · click thread line to collapse

16 comments

15 comments · 5 top-level

ZeroCool2u5y ago· 4 in thread

How does Vaex compare with Modin[1]?

[1]: https://modin.readthedocs.io/en/latest/index.html

devin-petersohn5y ago

Vaex is great if you're already working with a compatible memory-mapped file format; it'll be exceptionally fast in that case. That is the use case I believe they are (successfully) targeting.

[1] https://arxiv.org/pdf/2001.00888

ZeroCool2u5y ago

Got it, that's really helpful. Thank you for clarifying and all your hard work on Modin!

maartenbreddelsOP5y ago

ZeroCool2u5y ago

Got it, that makes sense! Will give it a try in the future for sure!

wodenokoto5y ago· 3 in thread

It seems like it has some nice advanced features that the data engineering team might appreciate once an application gets large.

But as the person who needs to load up some data and do some transformations, this article gives me very little information about why I should switch from pandas.

But I am excited to hear about new solutions in the data frame space!

musingsole5y ago

Vaex is stupid fast at all the data operations it supports to the point where I've used in it in place of a database for an API.

maartenbreddelsOP5y ago

Thanks, glad you find Vaex useful.

wodenokoto5y ago

I was more concerned about its api / methods.

Does it make things hard that was easy in pandas or does it make things that are hard in pandas easy?

1 more reply

sradman5y ago· 1 in thread

Why this hybrid dataframe? [1]:

There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems. I didn't realize this transition impacted NumPy arrays in addition to Pandas dataframes in Python.

[1] https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-w...

nomel5y ago

> There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems

Not, until there is proper multidimensional array support.

claytonjy5y ago· 1 in thread

Has anyone tried this, or know if it's possible?

maartenbreddelsOP5y ago

My guess is that should be possible, feel free to hop onto https://github.com/vaexio/vaex/discussions !

liminal5y ago· 1 in thread

Did something important change in Arrow 3.0 in terms of working with text data? Is Arrow good for text data in general? I think I'm missing some context needed in order to be impressed by this post

maartenbreddelsOP5y ago

Yes, Arrow 3.0 has much more string kernels https://arrow.apache.org/docs/cpp/compute.html

j / k navigate · click thread line to collapse