The implementation you use is the textbook "toy" one, though, which is a bit oversimplified and probably biases the perceived speedup. Simple and brute force often favors GPU implementation, whereas the communication can really hurt you in a more realistic algorithm.
It would be interesting to see what your comparison looks like with at least a spatial subdivision and connectivity analysis (the most obvious algorithmic speed up), proper anti-aliasing and a functional zoom. Even better if you did interior/exterior distance estimation...
If you have a specific problem in mind, I'd be happy to work on adding it as an example.
I'm in search of my first "customer" so to speak. :)
Do any of those modifications appeal? I understand it was student work, but everything I suggested is fairly straightforward.