A Python Compiler for Big Data (opens in new tab)

(continuum.io)

156 pointsbluemoon13y ago34 comments

34 comments

28 comments · 7 top-level

rpm432113y ago· 10 in thread

Bit of a tangent, but I'm wondering if anyone here has had any luck with Cython?

I'm starting to run into some performance bottlenecks with Python, and so I'm just now looking at Cython, PyPy, Psyco, and... gasp... C.

From what little I've read, Cython is supposed to be as easy as adding some typing and modifying a few loops here and there, and you are in business.

IanOzsvald13y ago

I taught High Performance Python covering the tools you mention at PyCon 2012 (and EuroPython last year), maybe my videos+write-up will be helpful. I also cover profiling, shedskin, pyCUDA etc:

http://ianozsvald.com/2012/03/18/high-performance-python-1-f...

Erwin13y ago

Depends on your application. Ideally you want to change your code so as much computation as possible can happen in pure-C code and pure-C data types (using Cython). If you have a big class tree with many callbacks and work spread over hundreds of method, that can be difficult.

Before you go that far, I'd recommend making sure you know all the Python gotchas (for example, maybe you have some inner loop that does for x in range(100000) all the time), that you algorithms are in order. Sometimes even silly microoptimization can make a difference if a small function is a significant amount of your runtime. Using multiple processes with e.g. the multiprocessing module can be an option too.

Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.

PS: check things like http://packages.python.org/line_profiler/ beyond the ordinary profiling.

travisoliphant13y ago

Cython is helpful, but you have to spell out a lot of type-information that is not specifically necessary. You might also try Numba --- easiest way to get it is via Anaconda CE or Wakari. Both at http://continuum.io

kkuduk13y ago

Here is a pretty nice example how to use Numba and some expectations about the speedup you may get: https://www.wakari.io/nb/travis/numba

chadillac8313y ago

LUA might be good for this too, it's pretty fast on it's own but from what I've read (not tested mind you) their C API is supposed to be pretty great.

http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

http://en.wikipedia.org/wiki/Lua_(programming_language)#C_AP...

freyrs313y ago

Lua has some areas where it excels in performance, but using Python you can leverage 10 years of work on numeric libraries that unrivaled in any other general purpose language. NumPy and SciPy are extremely powerful.

1 more reply

lrem13y ago

Cython is good, but sometimes it's a bit tricky to bend it to do exactly what you want[1]. You'll probably still want to write that hot piece in C... But gluing it with Cython is IMHO much nicer than using the plain Python API.

[1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly

frozenport13y ago

I would go with C/C++ as the ways to address performance are well studied. There are many tools out there like callgrind or nvvp that will make it pain-free.

I can narrow down performance in C/C++ quite quickly, but neither I nor anybody I know has done much of this for Python. Many people who I work with consider a Python implementation a prototype, while Fortran/C/C++ is mature real code worthy of attention.

The only real downside is that C/C++ requires a little knowledge of the POSIX/LINUX or Windows. This represents a learning curve, but when you are over it, it represents quite durable long lasting skills.

skriticos213y ago

I think from the development effort a more sensible approach is to build the whole thing in Python, then profile your application and find the performance bottlenecks. Then get your hands a little dirty with the Python C API. This way you can gain good performance without wasting too much time.

http://docs.python.org/3/extending/extending.html

http://docs.python.org/3/c-api/index.html

winter_blue13y ago

> Fortran/C/C++ is mature real code worthy of attention

Just be prepared for Drew Houston, Paul Graham et al. to come after you whipping their lashes.. (tongue in cheek)

1 more reply

greenonion13y ago· 4 in thread

So is there anyone using Python for machine learning in production systems (i.e. not just for prototyping). I would love to do it but seems Java/Mahout is a safer choice, performance-wise.

I wonder whether Blaze is a step towards that direction.

law13y ago

I use Python for nearly all of my ETL processes that involve text processing. Even in production systems, I'd be hard-pressed to admit any significant performance issues. Python facilitates implementing algorithms in a functional style, which I tend to prefer over the imperative style (i.e., Java). With C++11 and boost, I'm able to translate my Python code to C++ while preserving the functional style, which has immensely simplified prototyping/deploying NLP/ML algorithms while simultaneously begetting enormous performance gains. I see Python as an extremely viable alternative to Java.

greenonion13y ago

You got me a bit confused here. If I understand correctly what you 're saying, you 're still using Python for prototyping the core algorithms and C++ in actual production systems. I'm not saying Python is not good for production systems in general, I'm wondering whether it is good enough for real-world implementations of machine learning algorithms.

Also, I believe most people would consider Java as an alternative to C++, hence all the Java-based Apache projects, such as Mahout, Solr etc.

1 more reply

cmccabe13y ago

Have you tried Scala? It might let you write in a functional style and then not have to translate it to something else. Please don't interpret this as a troll; I'm genuinely curious what the pros/cons of these approaches are.

1 more reply

dwiel13y ago

We also use python in production at plotwatt for machine learning. We started by prototyping in matlab and then porting to c++, but have since found it much much easier to just do everything in python and numpy. When speed was an issue, we slightly changed the way we implemented the algorithm rather than implement the same algorithm in a faster language. Admittedly this isn't always possible.

Caligula13y ago· 2 in thread

I read about continuum after the fellow who developed numpy left a few days ago to work on continuum. I am curious to see actual projects using continuum. So some sort of writeups.

omni13y ago

You're being downvoted because Travis Oliphant, the original author of Numpy, is also a co-founder of Continuum Analytics.

hyperbovine13y ago

I figure he's talking about http://news.ycombinator.com/item?id=4931027. Not sure where the downvoting comes in...

lucian190013y ago· 2 in thread

Interesting approach to modelling data that lives elsewhere, in fact quite similar to SQLAlchemy's.

piqufoh13y ago

... but you can't use numpy operations efficiently on SQLAlchemy data

lucian190013y ago

That's not what I meant. Both this and SA turn python expressions into expressions to be run elsewhere, on data that isn't necessarily in the process' memory.

ezl13y ago· 1 in thread

I just want to point this out because I feel like there's a good chance a lot of people won't have gotten this far:

Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.

freyrs313y ago

Yes, using Numba we can just-in-time compile numeric Python logic straight down to machine code, so naturally we can achieve some pretty impressive numbers on kernel execution.

In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.

* http://blaze.pydata.org/docs/

* https://github.com/ContinuumIO/blaze

davidf1813y ago· 1 in thread

It would be great to eventually have a GPU version as well (as in the cases of Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the GPU version ran 30x the CPU version.

freyrs313y ago

GPU support is definitely planned and already supported in NumbaPro[2]. Here's a video of Travis Oliphant's talk about targeting CUDA through Numba:

[1] http://www.ustream.tv/recorded/26973799

[2] https://store.continuum.io/cshop/numbapro

andrewcooke13y ago· 1 in thread

how does this compare to theano? it seems like some of the ideas are similar?

http://deeplearning.net/software/theano/

in general, i like (ie i don't see a better solution than) the idea of having an AST constructed via an embedded language that is implemented by a library. but it does have downsides - integration with other python features is going to be much more limited (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

are there more details? i guess the AST is fed to something that does the work. and that something will have an API and be replaceable. but is that something also composable? does it have, say, a part related to moving data and another to evaluating data? so that you can combine "distributed across local machines" with "evaluate on GPU"?

freyrs313y ago

> how does this compare to theano? it seems like some of the ideas are similar?

It's quite similar, we just take some of the ideas farther and try to generalize the data storage to include storage backends that data scientists use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the Theano developers and hope to bridge the projects with a compatibility layer at some point.

> (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

I would argue that's what make Python a great numeric language, and NumPy so succesfull. You get this high level language where you can express domain knowledge but also this 1:1 mapping between fast code execution at the C level. Blaze is the continuation of that vision

> i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.

Precisely, we build up a intermediate form called ATerm out of the construction expression objects, do type inference, graph rewriting, and then pattern match our layout, metadata, and type information against a number of backends to find the most optimal one to perform execution. Or if needed we build a custom kernel with Numba informed by all this type and data layout information we've inferred from the graph.

We don't aim to solve all the subproblems in this area ( expression optimization passes, distributed scheduling ) but I think we have a robust enough system that others can build extensions to Blaze to do expression evaluation in whatever fashion they like.

> are there more details?

Yes! See: http://blaze.pydata.org/

j / k navigate · click thread line to collapse

34 comments

28 comments · 7 top-level

rpm432113y ago· 10 in thread

Bit of a tangent, but I'm wondering if anyone here has had any luck with Cython?

I'm starting to run into some performance bottlenecks with Python, and so I'm just now looking at Cython, PyPy, Psyco, and... gasp... C.

From what little I've read, Cython is supposed to be as easy as adding some typing and modifying a few loops here and there, and you are in business.

IanOzsvald13y ago

I taught High Performance Python covering the tools you mention at PyCon 2012 (and EuroPython last year), maybe my videos+write-up will be helpful. I also cover profiling, shedskin, pyCUDA etc:

http://ianozsvald.com/2012/03/18/high-performance-python-1-f...

Erwin13y ago

Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.

PS: check things like http://packages.python.org/line_profiler/ beyond the ordinary profiling.

travisoliphant13y ago

kkuduk13y ago

Here is a pretty nice example how to use Numba and some expectations about the speedup you may get: https://www.wakari.io/nb/travis/numba

chadillac8313y ago

LUA might be good for this too, it's pretty fast on it's own but from what I've read (not tested mind you) their C API is supposed to be pretty great.

http://benchmarksgame.alioth.debian.org/u32/which-programs-a...

http://en.wikipedia.org/wiki/Lua_(programming_language)#C_AP...

freyrs313y ago

1 more reply

lrem13y ago

[1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly

frozenport13y ago

I would go with C/C++ as the ways to address performance are well studied. There are many tools out there like callgrind or nvvp that will make it pain-free.

skriticos213y ago

http://docs.python.org/3/extending/extending.html

http://docs.python.org/3/c-api/index.html

winter_blue13y ago

> Fortran/C/C++ is mature real code worthy of attention

Just be prepared for Drew Houston, Paul Graham et al. to come after you whipping their lashes.. (tongue in cheek)

1 more reply

greenonion13y ago· 4 in thread

So is there anyone using Python for machine learning in production systems (i.e. not just for prototyping). I would love to do it but seems Java/Mahout is a safer choice, performance-wise.

I wonder whether Blaze is a step towards that direction.

law13y ago

greenonion13y ago

Also, I believe most people would consider Java as an alternative to C++, hence all the Java-based Apache projects, such as Mahout, Solr etc.

1 more reply

cmccabe13y ago

1 more reply

dwiel13y ago

Caligula13y ago· 2 in thread

I read about continuum after the fellow who developed numpy left a few days ago to work on continuum. I am curious to see actual projects using continuum. So some sort of writeups.

omni13y ago

You're being downvoted because Travis Oliphant, the original author of Numpy, is also a co-founder of Continuum Analytics.

hyperbovine13y ago

I figure he's talking about http://news.ycombinator.com/item?id=4931027. Not sure where the downvoting comes in...

lucian190013y ago· 2 in thread

Interesting approach to modelling data that lives elsewhere, in fact quite similar to SQLAlchemy's.

piqufoh13y ago

... but you can't use numpy operations efficiently on SQLAlchemy data

lucian190013y ago

That's not what I meant. Both this and SA turn python expressions into expressions to be run elsewhere, on data that isn't necessarily in the process' memory.

ezl13y ago· 1 in thread

I just want to point this out because I feel like there's a good chance a lot of people won't have gotten this far:

freyrs313y ago

Yes, using Numba we can just-in-time compile numeric Python logic straight down to machine code, so naturally we can achieve some pretty impressive numbers on kernel execution.

In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.

* http://blaze.pydata.org/docs/

* https://github.com/ContinuumIO/blaze

davidf1813y ago· 1 in thread

It would be great to eventually have a GPU version as well (as in the cases of Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the GPU version ran 30x the CPU version.

freyrs313y ago

GPU support is definitely planned and already supported in NumbaPro[2]. Here's a video of Travis Oliphant's talk about targeting CUDA through Numba:

[1] http://www.ustream.tv/recorded/26973799

[2] https://store.continuum.io/cshop/numbapro

andrewcooke13y ago· 1 in thread

how does this compare to theano? it seems like some of the ideas are similar?

http://deeplearning.net/software/theano/

freyrs313y ago

> how does this compare to theano? it seems like some of the ideas are similar?

> (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).

> i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.

> are there more details?

Yes! See: http://blaze.pydata.org/

j / k navigate · click thread line to collapse