It does not. Because, since both systems in comparison are in violation, when normalized to just this setting, the comparisons are meaningful. That is, while microbenchmarks are not something that should replace task relevant testing, they do have utility as a coarse indicator. The information should be taken with a large helping of uncertainty but it also points in the general correct direction in terms of relative ordering of the compared.
The reality is often that other things will dominate. Things such as computational complexity, appropriateness and optimizations of data structures in use, I/O bounds, cache locality and specific details of the problem that will tend to reduce and not magnify the differences between languages near each other in a relative ordering, when things are done properly. Or slow things majorly when things are not done properly. This holds especially if idiomatic code is not anymore expensive to write in any of the compared languages, as is the case here.