I think you're missing the reason why the GPU kicks you out of the fast path when you need that special function. The special function evaluation is fundamentally more expensive in energy, whether that cost is paid in area or time. Evaluation of the special functions with throughput similar to the arithmetic throughput would require much more area for the special functions, which for most computation isn't a good tradeoff. That's why the GPU's designers chose to make your exp2 slow.
Replacing everything with dozens of cascaded special functions makes everything uniform, but it's uniformly much worse. I feel like you're assuming that by parallelizing your "EML tree" in dedicated hardware that problem goes away; but area isn't free in either dollars or power, so it doesn't.