One thing I have struggled with optimizing is visualization and coordinate calculation of network graphs with 10s of millions of edges + nodes using networkX and most visualization tools. Have you looked into this utility for Vaex? Reading your article it sounds like it would be well-suited for it.
More interesting may be to identify clusters and either group them together or visualize these clusters as nodes themselves.
A billion 32 bit floating point numbers are 4 Gigabytes. How can that be processed in one second unless there was any preprocessing?
The relevance to big data or out-of-core computation is left hazy, which would make this I/O bound in most cases? 4 GB fits easily in memory and is just mmap'ed from the OS disk cache if the data was recently touched. I guess with 4 columns you get to 16 GB which might be pushing it on a laptop.
I just ran a quick benchmark: In [7]: %timeit -r3 -n3 df.mean(df.ra) 330 ms +- 5.46 ms per loop (mean +- std. dev. of 3 runs, 3 loops each) In [11]: f'{len(df):,}' Out[11]: '1,692,919,135' In [12]: 330/len(df)1e9 Out[12]: 194.92957057278463
so it is 0.2second for 1.7 billion rows, which is:
In [15]: (len(df)
8/10243)/0.2 Out[15]: 63.06615229696035463 GB/s. (this is a high end machine, on my laptop I get ~12GB/s)
We do not use float32 much in science since you really should know how not to screw up. It does give some extra performance boost (not much though), and also saves you on memory cache.
Don't get me wrong, this seems like a project I will use but that marketing speak is weird.
However, I am suprised to see no mention of Dask in the article. How do these libraries compare?
There is some overlap with dask.dataframe, I think they are closer to pandas than vaex is. Vaex has a strong focus on large datasets, statistics on N-d grids and visualization as well. For instance calculating a 2d histogram for a billion row can be done in < 1 second, which can be used for visualization or exploration. The expression system is really nice, it allows you to store the computations itself, calculate gradients, do Just-In-Time compilation, and will be the backbone for our automatic pipelines for machine learning. So vaex feels like Pandas for the basics, but adds new ideas that are useful for really large datasets.
BTW, for anyone on a Windows machine, getting this to work is very trivial.
There is a unix only library for locking files (fcntl) which prevents it from working on Windows. I mocked it in the path and made a function that returns 0 to test it.
Obviously adding a check for os and switching to a cross platform file locker would be a great contribution. I'll see if I can make that happen in the next week.
I have recently started using Xarray for some projects, and really appreciate the usability of multidimensional labelled data. Are the memory mapping techniques used for speedup here only applicable to tabular data?
The support for Apache arrow is quite nice. Have you considered any other formats, such as Zarr?
E.g. parquet supports column indexes now: https://issues.apache.org/jira/browse/PARQUET-1201
https://github.com/vaexio/vaex/blob/master/packages/vaex-arr...
One of the primary objectives of Apache Arrow is to have a common data representation for computational systems, and avoid serialization / conversions altogether.
Exploratory data analysis of large (but not huge) datasets have always been a slow and frustrating experience.
In the enterprise, we have plenty of datasets that are 100s of millions to a few billion rows (and many columns), so big enough to make conventional tools sluggish but not quite big enough for distributed computing. It sounds like vaex can help with EDA of these types of datasets on a single machine. I'd be interested in exploring the out-of-core functionality, which I hope means it will continue chugging along without throwing "out of memory" errors.
I've used similar proprietary libraries before, and virtual operations can be really powerful
EDIT: I then tried a python3.6 environment and it worked. I guess it answers my question
My question is to you is, would you be so kind to open an issue to decribe the failure on https://github.com/vaexio/vaex/issues ? Please share which OS, which Python distribution (anaconda maybe) and/or the installation steps and error msg.