Usually, offloading to the GPU will get you higher throughout but also higher latency. Although of course this assumes a sane software renderer, X11 API calls will take forever.
Really? It's counterintuitive considering the GPU draws the screen anyway. Is it because the shader pipeline has some inherent latency that a software renderer doesn't?