Show HN: How to analyse 100 GB of data on your laptop with Python (opens in new tab)

(towardsdatascience.com)

264 pointsmaartenbreddels6y ago26 comments

26 comments

21 comments · 8 top-level

isoprophlex6y ago· 5 in thread

How would this compare to ingesting this in a locally running rdbms, like postgres, and applying some judicious indexing? Or maybe a local spark cluster?

I'm tempted to benchmark this against better known tech, maybe anyone has some insight to share?

kthejoker26y ago

Think you'd want to compare with a columnar database since this is an analytic workload, not a transactional one.

And a 4 node Spark cluster with Parquet or Arrow files and a Scala job will compare favorably to this because it's also a lazy evaluator and the benchmark problems here are embarrassingly parallel.

PS this is a very cool project!

maartenbreddelsOP6y ago

I've never benchmarked against postgres, but would be interested about the results. I once tried monetdb, and it was orders of magnitude slower for simple calculations, so I stopped looking at RDMS'es after that.

I think they solve a different problem, smaller data, data/relational integrity, data mutations. Dataframes can take shortcuts here.

If you want to do a serious benchmark, feel free to contact us. Github: https://github.com/vaexio/vaex/issues Email: (I'm easy to google).

drblah6y ago

I am currently doing an internship as part of my masters degree where I am analyzing ~30 GB of data. I'm using Postgres + Python and it is working quite well, even on my 2014 MacBook Air.

It would indeed be interesting to see how this approach with Vaex compares to Postgres. Though, I would be quite sad giving up SQL in favour of Pandas DataFrame indexing and Python looping :)

maartenbreddelsOP6y ago

No Python looping happening in Vaex :), otherwise, we wouldn't get this performance.

We are also working on GraphQL support, with a Hasura-like API: https://docs.vaex.io/en/latest/example_graphql.html

I think GraphQL is easier in combinations with front end development, and you can tab-complete your way out. Early days for this sub-project, but I think very promising.

1 more reply

mooneater6y ago

Ive been switching an reporting system that did analysis in postgres, to analysis in pandas (mostly business-stats type summaries).

It feels like growing wings and a jetpack. Almost everthing is waaay easier and faster.

Aardwolf6y ago· 3 in thread

I found that python is very slow to use when analyzing billions of entries, if you do a function call per entry, because the overhead of a function call is so large in python (and may be much slower than what's actually inside the function). Even JavaScript can do this much faster.

Is there any way around that?

jofer6y ago

Use the scientific stack in python for that type of analysis.

Yes, anything with one function call per item is going to be slow. That, along with memory inefficiency of lists, is the reason why numpy exists.

uoaei6y ago

For simply-vectorizable analysis, sure, but in my work I often have to apply a nontrivial transformation to the data en masse and I'd rather define a single function which defines the transformation on each row to trying to wrestle the problem into one of matrix multiplications and additions.

1 more reply

maartenbreddelsOP6y ago

Indeed, numpy for numerical calculations. For strings, we have our own data structure based on Apache Arrow, but we plan/hope to move to Apache Arrow (in combination with numpy), since that's kind of the numpy++ for data science work.

slowenough6y ago· 3 in thread

What sort of person hours went into this?

maartenbreddelsOP6y ago

What do you mean by that?

make36y ago

stylised way of asking how long it took

1 more reply

Rotten1946y ago

number of people * hours taken per person

darkstar9996y ago· 1 in thread

> 99.97% of the passengers that pay by cash do not leave tips.

Correction - 99.97% of cash tips don't go recorded. Why pay tax if they never saw it happen?

Compare this to:

> 3.00% of the passengers that pay by card do not leave tips.

jovan316y ago

Yes you are absolutely correct! The intention was to just show the data as it. But I agree with your interpretation :)

(The author of the article).

jononor6y ago· 1 in thread

Vaex seems to be very similar to Dask and Xarray. Which one to choose?

maartenbreddelsOP6y ago

It is not similar to Dask, but similar to dask.dataframe. Dask.dataframe is built on top of Pandas, but that also means it inherits its issue, like memory usage, and performance. (BTW, totally a fan of Pandas).

Xarray is more about nd-arrays, less about ~tabular data.

Vaex is built from the ground up with the idea that you can never copy the dataset (1TB dataset was quite common). We also never needed distributed computing, because it was always fast enough, and thus never had to use dask (although we're eager to support it fully).

Also, vaex is lazy but tries to hide it from you. For instance, if you add a new column to your dataframe, it will only be computed when needed (taking up 0 memory). However, in practice, you're not really aware of that. This means it feels more like pandas (immediate results) than dask.dataframe (no .compute()/.persist() needed).

I would say they are all complementary, with small amount of overlap. Small data: use Pandas. Out of memory error: move to vaex. Crazy amount of data (100TB?) that will never fit onto 1 computer: dask.dataframe, or help us implement full dask support.

Jackie11226y ago

Thanks for sharing...

deepakkhealani16y ago

Sorry, I don't have any information related to it but I appreciate for your nice question.

floki9996y ago

Dask’ s compatibility with Pandas makes it ideal. Very easy to use.

j / k navigate · click thread line to collapse

26 comments

21 comments · 8 top-level

isoprophlex6y ago· 5 in thread

How would this compare to ingesting this in a locally running rdbms, like postgres, and applying some judicious indexing? Or maybe a local spark cluster?

I'm tempted to benchmark this against better known tech, maybe anyone has some insight to share?

kthejoker26y ago

Think you'd want to compare with a columnar database since this is an analytic workload, not a transactional one.

And a 4 node Spark cluster with Parquet or Arrow files and a Scala job will compare favorably to this because it's also a lazy evaluator and the benchmark problems here are embarrassingly parallel.

PS this is a very cool project!

maartenbreddelsOP6y ago

I think they solve a different problem, smaller data, data/relational integrity, data mutations. Dataframes can take shortcuts here.

If you want to do a serious benchmark, feel free to contact us. Github: https://github.com/vaexio/vaex/issues Email: (I'm easy to google).

drblah6y ago

I am currently doing an internship as part of my masters degree where I am analyzing ~30 GB of data. I'm using Postgres + Python and it is working quite well, even on my 2014 MacBook Air.

It would indeed be interesting to see how this approach with Vaex compares to Postgres. Though, I would be quite sad giving up SQL in favour of Pandas DataFrame indexing and Python looping :)

maartenbreddelsOP6y ago

No Python looping happening in Vaex :), otherwise, we wouldn't get this performance.

We are also working on GraphQL support, with a Hasura-like API: https://docs.vaex.io/en/latest/example_graphql.html

I think GraphQL is easier in combinations with front end development, and you can tab-complete your way out. Early days for this sub-project, but I think very promising.

1 more reply

mooneater6y ago

Ive been switching an reporting system that did analysis in postgres, to analysis in pandas (mostly business-stats type summaries).

It feels like growing wings and a jetpack. Almost everthing is waaay easier and faster.

Aardwolf6y ago· 3 in thread

Is there any way around that?

jofer6y ago

Use the scientific stack in python for that type of analysis.

Yes, anything with one function call per item is going to be slow. That, along with memory inefficiency of lists, is the reason why numpy exists.

uoaei6y ago

1 more reply

maartenbreddelsOP6y ago