This is a very poor way to compare operating systems. Each has different optimizations by default. That, and many many other factors make this a terrible project. It's unfortunate that so much work was put into testing so few things when many more things need to be considered. The set of data (machine variations, software tuning, etc) are far too small. It doesn't matter how many times one runs the same test on one machine if it's not the machine he's really testing, but the operating system.
Now if it was only a matter of what operating system without tuning works best on this one machine without tuning, then this might just legitimate.