Skip to content

Top Best Ask Show New Jobs

Show HN: Vaex - Out of Core Dataframes for Python and Fast Visualization (opens in new tab)

(medium.com)

126 pointsmaartenbreddels7y ago32 comments

32 comments

27 comments · 10 top-level

angelmass7y ago· 3 in thread

Very interesting! I will share it with my DS friends.

One thing I have struggled with optimizing is visualization and coordinate calculation of network graphs with 10s of millions of edges + nodes using networkX and most visualization tools. Have you looked into this utility for Vaex? Reading your article it sounds like it would be well-suited for it.

bayesian_horse7y ago

The bigger question is what you want to achieve by visualizing so many nodes. If you want a map that can be zoomed in to view individual nodes, you mainly need to compute coordinates for every node. Finding the arrangement of the node is probably what gets you in trouble, so you probably need a custom algorithm which scales better (and does poorer, probably).

More interesting may be to identify clusters and either group them together or visualize these clusters as nodes themselves.

maartenbreddelsOP7y ago

I have not looked into it, maybe datashader can do this, which is a package purely focussing on viz, while vaex is more allround (although there is overlap). If you think vaex can be useful here, feel free to ask question/open issues https://github.com/vaexio/vaex

blattimwind7y ago

Gephi?

aw3c27y ago· 3 in thread

> For example, it takes about a second to calculate the mean of a column in regular bins even when the dataset contains a billion rows (yes, 1 billion rows per second!).

A billion 32 bit floating point numbers are 4 Gigabytes. How can that be processed in one second unless there was any preprocessing?

Desktop PCs have about 35 GB/s of memory bandwidth and can do compute at ~200 Gflops, so this is just ~10% of peak bw and leaves you a budget of 200 flops computation per float value. If all 4 columns are accessed, there is still enough bandwidth (no idea of the data here was columnar layout or not).

The relevance to big data or out-of-core computation is left hazy, which would make this I/O bound in most cases? 4 GB fits easily in memory and is just mmap'ed from the OS disk cache if the data was recently touched. I guess with 4 columns you get to 16 GB which might be pushing it on a laptop.

maartenbreddelsOP7y ago

You are right, I'm actually underselling it. 1 second is the typical performance for doing a 2d histogram (or other binned statistics) since it involves writing to memory as well.

I just ran a quick benchmark: In [7]: %timeit -r3 -n3 df.mean(df.ra) 330 ms +- 5.46 ms per loop (mean +- std. dev. of 3 runs, 3 loops each) In [11]: f'{len(df):,}' Out[11]: '1,692,919,135' In [12]: 330/len(df)1e9 Out[12]: 194.92957057278463

so it is 0.2second for 1.7 billion rows, which is:

In [15]: (len(df)

8/10243)/0.2 Out[15]: 63.066152296960354

63 GB/s. (this is a high end machine, on my laptop I get ~12GB/s)

We do not use float32 much in science since you really should know how not to screw up. It does give some extra performance boost (not much though), and also saves you on memory cache.

aw3c27y ago

My thought was on first access or out-of-memory sizes. This would always be bound by I/O which means it is kind of a meaningless statistic.

Don't get me wrong, this seems like a project I will use but that marketing speak is weird.

themmes7y ago· 2 in thread

First of all, great to see more powertools to choose from for my ds workflow!

However, I am suprised to see no mention of Dask in the article. How do these libraries compare?

maartenbreddelsOP7y ago

Dask and vaex are not 'competing', they are orthogonal. Vaex could use dask to do the computations, but when this part of vaex was built, dask didn't exist. I recently tried using dask, instead of vaex' internal computation model, but it gave a serious performance hit.

There is some overlap with dask.dataframe, I think they are closer to pandas than vaex is. Vaex has a strong focus on large datasets, statistics on N-d grids and visualization as well. For instance calculating a 2d histogram for a billion row can be done in < 1 second, which can be used for visualization or exploration. The expression system is really nice, it allows you to store the computations itself, calculate gradients, do Just-In-Time compilation, and will be the backbone for our automatic pipelines for machine learning. So vaex feels like Pandas for the basics, but adds new ideas that are useful for really large datasets.

How could I've missed you being the author. Thanks for your extensive answer, will definitely try the library! And thanks again for Ipyvolume, has been very useful so far.

JPKab7y ago· 2 in thread

Such phenomenal work.

BTW, for anyone on a Windows machine, getting this to work is very trivial.

There is a unix only library for locking files (fcntl) which prevents it from working on Windows. I mocked it in the path and made a function that returns 0 to test it.

Obviously adding a check for os and switching to a cross platform file locker would be a great contribution. I'll see if I can make that happen in the next week.

maartenbreddelsOP7y ago

There is an issue open for this: https://github.com/vaexio/vaex/issues/93 It should have been fixed, some more detailed report (version numbers installed) would be good to know.

maartenbreddelsOP7y ago

Oh, and thanks for the kind words!

rax7y ago· 2 in thread

It looks quite nice, and I will have to explore the performance comparisons with Dask more.

I have recently started using Xarray for some projects, and really appreciate the usability of multidimensional labelled data. Are the memory mapping techniques used for speedup here only applicable to tabular data?

The support for Apache arrow is quite nice. Have you considered any other formats, such as Zarr?

maartenbreddelsOP7y ago

Thank you. Memory mapping could be used for other data as well, and I have looked into zarr (even opened an issue for that https://github.com/zarr-developers/zarr/issues ). Memory mapping of contiguous data makes life much easier (for the application as well as OS), chunked data could be supported, but is more bookkeeping.

ah-7y ago

I'll need to have a closer look later, but would vaex fit in with somewhat indexed mapped files?

E.g. parquet supports column indexes now: https://issues.apache.org/jira/browse/PARQUET-1201

ah-7y ago· 2 in thread

Great to see that you're supporting Apache Arrow! That makes it so much easier to gradually switch over.

wesm7y ago

Note: Vaex has its own memory model. If you input Arrow, it converts to the Vaex data representation. Details here:

https://github.com/vaexio/vaex/blob/master/packages/vaex-arr...

One of the primary objectives of Apache Arrow is to have a common data representation for computational systems, and avoid serialization / conversions altogether.

maartenbreddelsOP7y ago

That is not correct, I just refer to the buffers/memory, 0 copying going on. Vaex is not really opinionated about the memory model actually. The only exception is the bitmasks that are being copied for now because of an incompatibility with numpy. But if I get a 50GB Arrow dataset, vaex leaves the structure intact. Thanks for your work on Arrow, I hope to support and contribute more to it in the future.

wenc7y ago· 1 in thread

Nice work. This looks like it could add a lot of value to a DS's toolbox.

Exploratory data analysis of large (but not huge) datasets have always been a slow and frustrating experience.

In the enterprise, we have plenty of datasets that are 100s of millions to a few billion rows (and many columns), so big enough to make conventional tools sluggish but not quite big enough for distributed computing. It sounds like vaex can help with EDA of these types of datasets on a single machine. I'd be interested in exploring the out-of-core functionality, which I hope means it will continue chugging along without throwing "out of memory" errors.

maartenbreddelsOP7y ago

That is exactly the sweet spot for vaex, and with a familiar DataFrame API (read pandas like) the transition does not hurt so much. It may sound cool to set up a cluster, but in many cases it is overkill, and vaex can get these kinds of jobs done.

stestagg7y ago· 1 in thread

This is big news.

I've used similar proprietary libraries before, and virtual operations can be really powerful

maartenbreddelsOP7y ago

Thank you, yes they give much more flexibility: optimization (JIT), derivatives, checking your calculations afterwards, sending them to a remote server etc. Glad you like that :)

colobas7y ago· 1 in thread

Does it have python3 support? Tried installing it on a python3.7 environment and it failed

EDIT: I then tried a python3.6 environment and it worked. I guess it answers my question

maartenbreddelsOP7y ago

Absolutely, I think nowadays the question should be: 'does it still support Python2?' (it does btw)

My question is to you is, would you be so kind to open an issue to decribe the failure on https://github.com/vaexio/vaex/issues ? Please share which OS, which Python distribution (anaconda maybe) and/or the installation steps and error msg.

Uses HDF5, which itself is a great file format, well suited for big tables of numbers. Good for similar reasons as SQLite3, but for different applications. Not a relational database, columns are more strongly typed. Better suited when you have hundreds or thousands of columns, worse when you're trying to query a particular row.

j / k navigate · click thread line to collapse