undefined | Better HN

0 pointsaw3c27y ago0 comments

My thought was on first access or out-of-memory sizes. This would always be bound by I/O which means it is kind of a meaningless statistic.

Don't get me wrong, this seems like a project I will use but that marketing speak is weird.

0 comments

1 comments · 1 top-level

maartenbreddels7y ago

Don't expect this performance on a 4GB machine. Most machines now would have 16GB, or more. Let us assume you have 32GB and take a 24GB dataset. Most libraries load this into memory (allocating 24GB, leaves max 8 GB for the OS, including disk cache). The next process that wants to do the same cannot without a memory error. Also, when you restart your program, the OS will not have it in the cache, it will be as fast as your hard drive.

Vaex is much smarter with memory, it memory maps the data, nothing is allocated, all the memory is left to the OS for disk cache. This means you can have 10 users regularly restarting/running their program, without any (or minimal) harddrive activity. So I think it is a fair statement, and actually very conservative (see my other comments). An important message is to make people aware that working with a dataset this large is a pip install away, no need to spin up a cluster (yet).

j / k navigate · click thread line to collapse