For reference latest Titan X offers 12 TFLOPs [1] and upcoming AMD card for Deep Learning [2] offers 13 . Though its not clear if TPU performance is calculated at fp16 or fp32[2]. The best GPUs currentLY available on AWS offer mere 2 TFLOPs per GPU [3].
[1] https://blogs.nvidia.com/blog/2017/04/06/titan-xp/
[2] http://pro.radeon.com/en-us/vega-frontier-edition/
[3] http://images.nvidia.com/content/pdf/tesla/NVIDIA-Kepler-GK1...
Tesla V100 is the thing to be compared with, as it's the first chip optimized for training with the Tensor Core operation (4x4 matrix multiplication and accumulation with mixed fp16/fp32 precision: the inputs to be multiplied are fp16, the accumulation is fp32). V100 performance 100 TFlops this way.
In fact to an extent even NVidia has realized that there is more money in creating a GPU cloud from scratch rather than selling GPUs.
I think the net losers are Apple, Amazon/AWS (I believe NVidia is responsible for their lackluster GPU offerings.) & Intel (Who are still hoping for Multi-core to work. And are on track to be disappointed just like they lost mobile market to ARM hoping for Atom to be eventually adopted.)
I feel like that should be "we have designed" or "we've designed". It seems like someone was in the middle of rewriting it and only got halfway there.