Yes that might be the case. In my case I mostly trained big (tens to hundreds of millions of parameters) networks mostly made of 3x3 convolutions, and I think the V100 has dedicated hardware for that. Then as I mentioned you can get a further 2x speedup by using half precision.
If you train smaller models, or RNN, you probably lose most of the gains of dedicated hardware. But I guess that for this same reason the experiments in the article are little more than a provocation, I don't know if you could train a big network in finite time on M1 chips...
That said, of course, if the budget was mine, I wouldn't buy a V100 :-)