I'm also happy to answer general questions about parallel linear algebra and to point people to the appropriate literature.
With that said, the main advantage of the library is its high-level of software-engineering, which tends to encourage the rapid development of new features. It is actively used within a large number of research projects.
Also, not all supercomputers have accelerators (consider Blue Gene/Q), and often simply having access to more memory is more of a concern than solving the problem at the absolute fastest rate.
I imagine the authors could make a single node of the cluster use a GPU, if they wanted.
Distributed memory and GPUs are not mutually exclusive. Multi-GPU clusters are extremely common. In fact the latest devices (e.g. Tesla K10) have multiple GPU processors packaged in a single card, so it is necessary for applications to target multiple GPUs. There is explicit support for distributed-memory applications in GPUs through the "GPUDirect" technology that allows peer-to-peer DMA and RDMA transfers between GPUs.
Given that reports of 30-50x GPU performance gains (versus CPUs) are common, the issue is important because it means solving a problem with (say) $10,000 of kit instead of $500,000.
boost::uBLAS
eigen
armadillo
a dozen other...
why not contribute to an existing project?The reoccurring bifurcation of talent and resources in the open source community is really disheartening. Can't we focus on one or two libraries and make them actually good? Or at least fork off of something that already exists and add your own features. I look at benchmarks of the existing tools and one library will do one operation very efficiently, while another will work well with something. Often the differences in speed are huge (more than a factor of 10). So I end up having to flip a coin in choosing which library to use.