undefined | Better HN

0 pointsn00b10112y ago0 comments

Maybe I should have just said that I think GPU support would be a great addition.

Distributed memory and GPUs are not mutually exclusive. Multi-GPU clusters are extremely common. In fact the latest devices (e.g. Tesla K10) have multiple GPU processors packaged in a single card, so it is necessary for applications to target multiple GPUs. There is explicit support for distributed-memory applications in GPUs through the "GPUDirect" technology that allows peer-to-peer DMA and RDMA transfers between GPUs.

Given that reports of 30-50x GPU performance gains (versus CPUs) are common, the issue is important because it means solving a problem with (say) $10,000 of kit instead of $500,000.

0 comments

5 comments · 1 top-level

jedbrown12y ago· 4 in thread

30x to 50x claims are the result of a terrible CPU implementation. When normalized by power consumption (e.g., TDP) or by acquisition cost (assume high-end rather than consumer versions, as with all supercomputing centers), you'll find about a 2x advantage for operations like DGEMM and much less for other operations. Strong scaling is important for many super-computing applications and accelerators like GPUS and Intel MIC are much slower for that purpose.

For examples, look at the MKL benchmarks for Xeon Phi (http://software.intel.com/en-us/intel-mkl/) and normalize as 3 Sandy Bridge sockets per Xeon Phi (common configurations pair one SB socket with one Xeon Phi, the Phi has more than twice the TDP and costs more than twice as much). Don't forget to look at QR and Cholesky, for which the Phi at best breaks even, but only for enormous matrix sizes.

m_mueller12y ago

It may sound strange, but one of the advantages of GPUs is actually the programming model, hence why many applications suddenly perform 20-50x faster instead of only the theoretical 5-7x.

Let me explain: One of the most common issues with x86 HPC applications coming from the scientific crowd is a lack of vector optimization such as loop unrolling. Even having the right compiler flags is rather difficult for this kind of thing. Another reason is a lack of understanding on how to program for memory bandwidth optimization. GPU programming on the other hand, especially with CUDA, is hard to get into at first, but once you have the right formula you can apply it pretty easily to most common tasks. Getting to, say, 70% of model performance on GPU is much easier than on x86. One reason is the implicitly bandwidth optimized idiomatic way of writing CUDA/OpenCL programs as a set of scalar kernels applied over a whole data region - this allows the programmer to think of block dimensions in an abstract way - no need to fiddle around manually with loops to achieve this. There is also no need to use any intrinsics, just plain C in idiomatic CUDA is enough.

So, to wrap it up, there is more to GPU programming than just the hardware itself, the software model actually makes a lot more sense than traditional OOP/procedural programming for HPC - often resulting in higher than expected speedups when going from idiomatic x86 to idiomatic GPGPU (since there is no such thing as easy to learn idioms for HPC x86 programming).

And btw. Xeon Phi is the result of Intel not understanding exactly this interrelation, since it doesn't even have OpenCL support as of now.

jedbrown12y ago

This is actually very subtle. Intel's OpenCL stack is currently atrocious, but if they ever make it decent, it will be able to vectorize. The problem is that vectorization is only part of the equation and many (likely most) applications today are limited in some way by memory and/or communication rather than by vectorization. Meanwhile, vectorization is generally at odds with memory bandwidth. CPU cache, shared memory on a GPU, and registers on a GPU are limited and vectorization usually leads to less effective use of these precious resources. Additionally, many kernels have inconvenient dependencies (e.g., an 30-term equation of state, a 70-species reaction mechanism) that force extremely low occupancy for typical GPU implementations (e.g., one thread per quadrature point). If you try to make the parallelism finer grained to reduce the register demands (break the quadrature point into several parts), you either need physics-dependent decomposition within a thread block or multiple kernel invocations (higher latency and needs to reload from memory). All of these things are counter-productive for strong scaling. Recall that a lot of applications are run near the limit of strong scaling and that CPU network latencies are in the microsecond range while GPU kernel launch overheads start at 20 microseconds (and you likely need several of them per inter-node communication).

While it's true that there are kernels that are easier to optimize for a GPU, I still think most claims of huge speedup are due to neglecting the CPU implementation. See "Debunking the 100x GPU vs CPU Myth" and other papers on this. And amidst all of that, there are a lot of applications that utterly fail on GPUs, despite having lots of parallelism, just not at quite the right granularity.

jeff_science12y ago

CPU programming on the other hand, especially with OpenMP, is hard to get into at first, but once you have the right formula you can apply it pretty easily to most common tasks.

My guess is that you've spent a lot more time trying to understand CUDA and NVIDIA hardware. Or maybe all of your arrays are dimensions that are multiples of 16.

poulson12y ago

Agreed. Accelerators can certainly lead to speedups, but it's hard to overemphasize how important it is to also have an optimized vanilla pure-MPI implementation (if nothing else, to compare the 'accelerated' version to).

j / k navigate · click thread line to collapse