I believe Hinton objects to this gross loss of spatial information for two reasons: 1) Humans don't lose so much spatial information, and Hinton would like his models to ultimately capture a neurologically plausible computation. 2) It may not be necessary for object detection (Imagenet), but it would likely be important for more sophisticated tasks.
I think he is saying the same thing about Max Pooling. Just my guess.
I wonder if he knew about the Stanford paper that demonstrates this? Or if he just guessed this would happen.
http://googleresearch.blogspot.ca/2014/11/a-picture-is-worth...
Or the Toronto one? http://deeplearning.cs.toronto.edu/i2t
Two institutions he is affiliated with.