My first thought is that it makes sense that averaging together a bunch of local predictions would work well on the ImageNet task, since the different classes tend to have obviously different local textures, and class-relevant information makes up a large part of the image. I would be very curious to see if the technique is as competitive for other tasks.