The main issue is that they are based on a model of continuously modifying a very large, monolithic body of state representing fine details about what the next draw should do. At any moment a draw call may be issued to enact the current state and produce a result.
In the past, that state was represented in hardware mostly using a large collection of physical registers. Nothing else could possibly be fast enough. The API model of "set BlendStateSourceOp, set BlendStateDestOp, ect..." mapped very well to the hardware. You literally were continuously mutating a large block of registers.
In the present, programmable hardware has become capable of largely taking over for fixed-function hardware. Modern GPUs have been increasingly cutting out special-purpose silicon to make room for more multi-purpose ALUs. These general-purpose ALUs represent how to draw using fairly large, allocated structures instead of single-purpose registers. These structures are not trivial to construct and modifying them continuously is not advised. However, switching between them is as trivial as moving a pointer from one to the other.
Fortunately, most games don't actually use a continuum of states when drawing. In practice, they switch repeatedly between a small number of states with very little variation between frames. Therefore, modern drivers do a lot of work to implicitly infer what state setups are heavily repeated within each run each application. These states are baked into structures under the hood on the fly. Odd variants are expensive in this mode. But, they are also rare, so they are lower priority.
Mantle, Metal and DX12 all seek to reboot the idea of graphics APIs from scratch based on how hardware actually works today. You set up a an explicit set of draw state structures at init time. You switch between them explicitly and trivially at run time.
A second issue baked into OGL/D3D is that, in the past, the monolithic draw state was stratified into quite nicely orthogonal chunks dealing with separate issues such as: how to load a vertex from memory vs. how to operate on a vertex vs. how to pass data from the vertex shader to the fragment shader vs. how to operate on a fragment (sample) vs. how to blend the fragment into the framebuffer. This model made the APIs quite nice to learn and to use.
Unfortunately, it is simply not representative of how the hardware actually operates today. Today, most of those operations are actually handled by general purpose ALUs. These ALUs are running the vertex and fragment programs you wrote. But, they are also running more code to handle what used to be done in fixed-function silicon. Actually, it's worse than that. What used to be a register flip that was completely orthogonal to your vertex/fragment programs is now actually implemented by modifying code interleaved into the guts of the programs you compiled back at init time. These changes are done under the hood and on the fly.
Modifying the code under the hood is expensive. Worse, the draw state is so large and complicated that it is easy to accidentally request an invalid state. Validating each given state is expensive. Because the classic model lets you make draw state changes at any time preceding a draw and the state changes are no longer stratified, the state validation can no longer be done incrementally. Instead, every time you draw a significant amount of work is done just to make sure the request makes sense.
Again, by declaring draw states up front. Compilation and validation can be done once up front. Switching between pre-compiled, pre-validated states is trivial.
A third issue is that OGL/D3D have the genuinely great goal of preventing and/or detecting synchronization errors in the usage of the API. In other words, you really shouldn't try to have the CPU modify a given block of memory while the GPU is simultaneously reading that same memory in an uncoordinated fashion. OGl and D3D have an interface and implementation designed to prevent/detect/allow-at-a-huge-cost these usage errors as much as possible. In practice, serious programs cannot ship with these errors. That means that in practice, all serious, shipping programs do not have these errors to any significant degree, but the driver is still always doing a large amount of work checking for them all of the time.
The new-style APIs seem more inclined to declare this category of usage errors to be undefined behavior rather than pay the cost to handle them. "Here's how to avoid them. So... avoid them."
A fourth issue is that multi-core computing is much more common and important than it was in the past. OpenGL has never had in interface to issue draw command from multiple threads of a single process. D3D11 had an interface to record commands on multiple threads and dispatch them on a primary thread, but the consensus is that D3D11's implementation did not work as well as was expected in practice.
Mantle, Metal and DX12 all have new, multi-threaded interfaces that they are quite confident will work well in practice.
Much of what I'm describing here is covered in this presentation from Microsoft "DirectX 12 API Preview" https://www.youtube.com/watch?v=m0QkjKGZQzI
An alternative approach has been proposed by a multi-vendor group of OpenGL driver developers. It was presented in the "Approaching Zero Driver Overhead" (AZDO) talk at GDC 2014. http://gdcvault.com/play/1020791/ and https://www.khronos.org/assets/uploads/developers/library/20...
In the AZDO approach, instead of tossing out the legacy state machine of OpenGL, they demonstrate how some current (fairly cutting edge) features that have recently been added allow a draw state to be set up that is so expressive and so extensive that it can pretty effectively represent a whole, fairly complicated scene of a modern game in a single draw state. Once you set this up, you can pretty much issue a single request to draw much-if-not-all of the current frame as an atomic operation. Further, common frame-to-frame modifications (such as moving objects around) are very cheap in this setup.
ADZO is an interesting and perfectly workable approach. I am less of a fan of that approach than I am the DX12 approach.
I should make this into a blog post... I should start a blog...