The target image processing algorithms are low level. Think "How can I write the fastest 3x3 box blur for Sandy Bridge and later architectures", not "How can I speed up my face detector".
Examples of scheduling decisions that Halide deals with: process image in 4x4? 8x8? 16x16 chunks? use a temporary buffer for an intermediate processing stage, causing worse cache behavior but reusing results that are needed more than once? use a sliding window the size of a couple rows as a compromise?
This kind of work means the difference between a laggy or interactive Gaussian Blur radius slider in Photoshop.
The Halide code shown at the talk was replacing really hard-to-read code with lots of SIMD compiler intrinsics. Dozens of lines of code doing something that would be 5 lines in a naive implementation. With Halide, it's almost as readable as the naive version because the scheduling stuff is separated from the algorithm.
For an application like Photoshop, this is a big win because they will never choose code readability over performance. Performance wins every time. If they can get the same performance with readable code, they are very happy.
GPU code generation falls naturally out of the scheduling intelligence and restricted problem domain.
I have never used Halide, so I do not intend to endorse it, but this line of inquiry is absolutely useful for a certain niche of programming.
The theoretical plan is basically to write an optimizer that could intuit good schedules, recognizing that they actually don't have a good idea of how to do this. I think they can currently run some ML algorithms that churn through a bunch of different schedules and find the one most fit for a problem but it's rather brute force and slow at the moment.
That said, the conceptual distinction of separating the algorithm from the scheduling also allows you to tune scheduling by hand much easier than would be possible otherwise.
This approach makes a lot of sense: abstract the annoying low level architecture details. They have a lot of targets which is fantastic: x86/SSE, ARM v7/NEON, CUDA, Native Client, and OpenCL. Let the architecture specialist worry about the architecture specifics. The disadvantage: the achieved performance then depends on quality and wisdom of the compiler. But once certain things are optimized for a specific architecture, every user will benefit.
How they do it on the compiler end of things, I'm not sure. There are a number of techniques. Among the simpler is auto-tuning. There is also a new term: "copious-parallelism" [0]. It acknowledges that to achieve performance portability across platforms, algorithms must offer explicit ways of parametrization and tuning to adapt to different platforms. I think this is the right concept but believe that it could be implemented within the compiler. The domain specialist should not have to think about those things.
[0] http://www.hpcwire.com/2014/01/09/future-accelerator-program...
I have the print version of book in my collection. It is interesting.
Now, superoptimization is not a particularly new topic either (see e.g. [1]), but Halide demonstrated that improvements can be significant in this domain. Far more impressive than, e.g., superoptimization applied to generic assembly code (hasn't really worked so far), query optimization, etc.
Now, it's certainly possible to nail general-purpose programming features onto a DSL (e.g. matlab or R), but the approach (also used by Halide) to embed the DSL into a general-purpose language that allows you to do so (here:C++) is, in my eyes, vastly better suited for scaling up from experimentation code to something that can become part of a larger system (think OpenCV in robotics etc.)
A fair bit of work has recently been done on making Halide work well on the Windows platform. We are erring toward requiring newer versions of Visual Studio however. (E.g. there are advantages on being able to rely on C++11 support for generating Halide code.)
I do not see any issues filed on things not working with Intel's compiler. Feel free to file one. Halide is an open source project and both bug reports and patches help a great deal.
(1) Halide can operate as either a Just In Time (JIT) compiler or an Ahead Of Time (AOT) compiler. Generally AOT is much more convenient for deploying in production. The end result of of compiling a single filter in Halide is an object file and a header file a single C-linkage function with all of its parameters passed in on the call. The minimal runtime is included in the object file and dependencies on libraries are also quite minimal, things like libc and pthreads. Functionality provided by the runtime can easily be overridden by providing one's own definitions for various functions, as weak linkage is used.
There is also a set of Python bindings, which can be used to write Halide code which is then AOT compiled. The AOT compiled object file can be linked into pretty much anything that can handle C functions either via direct linkage or a foreign function interface.
[0] - http://www.imdb.com/title/tt0083658/quotes?item=qt1827458
Here's the most famous Halide in Turkey: http://en.wikipedia.org/wiki/Halide_Edip_Ad%C4%B1var
I want to write in OpenCV, but I also want the compiler to fix everything across function boundaries :-)
If you look at Google Renderscript, Microsoft C++AMP, NV Cuda, OpenCL and whatnot you'll realize that they all do exactly the same thing, difference being only in syntax level (with few caveats, but those are relatively minor). Some being more cumbersome to use than others.
Halide looks neat to use syntax wise but it doesn't seem to make the actual algorithms any easier to write. You still have to do the same mental work on same level as you have to do with OpenCL.
All of these are to eachother what Fortran, Pascal and C are to eachother. Same basic idea in different package. I'm waiting for the first system that is equivalent of C++. Something that really brings new concepts to the table instead of just a different syntax.
The end result is not like OpenCL, CUDA/PTX, or RenderScript. (All of which we regard as suitable target languages for Halide. PTX (CUDA's low-level assembly language) and OpenCL are currently supported.) As a concrete example, there is no explicit memory allocation in Halide and loops are often implicit. The overall character of Halide is that of a functional programming language.
As another example, segmentation violations due to errors in Halide code are impossible, barring bugs in the compiler implementation or errors in user provided C++ code, which Halide has no control over. Some set of bounds errors are reported at compile time and the rest turn into (efficient) assertion checks at runtime.
Schedules do get pretty low-level and writing them is difficult. Algorithms are generally a fair bit easier to write and while there is plenty to be done in improving the language, it is already a step up in productivity over e.g. OpenCL or even straight C++ for a lot of tasks within the domain of data parallel processing.
I'm waiting for the first system that is equivalent of C++.
Not gonna happen anytime soon. C++ is much more than just a programming language, it is an entire "ecosystem". You have toolchains, libraries, drivers, APIs, existing software stacks, all written in C++ and able to interface with C/C++ directly, i.e. no translation layer needed. You're not just gonna replace that with a new language.Sure, Halide may be seen as just some syntactic sugar (much like quite a few bits and pieces of C++11), but it actually provides you with a different level of abstraction than say OpenCl, MPI or OpenMP where you are very specific about e.g. level of concurrency (which heavily impacts the design of your algorithm) while Halide tries to almost completely separate algorithm and scheduling.
It's nice that Halide attempts to separate the algorithm and scheduling, however as of now it's also possible in OpenCL too. The implementations are just so bad at it that it's not really useful.
Obligatory HN reaction: I'd wait it out for the first Lisp to appear.
This is about optimizing memory locality and parralelism. Its not about getting access to the underlying hardware such as Microsoft C++AMP, NV Cuda, OpenCL.
If you write a kernel which doesn't use local memory nor doesn't use the local_id it produces a kernel that is effectively a Halide pipeline stage. The points can be evaluated in arbitrary order (spec says everything is implementation defined).
If we look at the blur example on the video the OpenCL implementation is also free to effectively merge the stages like in Halide. It's because the spec only defines that whatever the previous kernel invocation has written must be visible for the next kernel. Nothing more and nothing less.
Sure OpenCL allows you to fiddle around with low level details, but it also allows you to write completely platform neutral code that is then the responsibility of the platform to actually optimize.
I do agree that Halide allows you to easily explore the different scheduling options, that's something OpenCL is not capable of. In OpenCL you either do it manually or leave it as totally defined by the implementation.