Ah, nevermind, (a) even the "non-optimized" C version uses -O3, and (b) the "C" and the "optimized C" programs differ not only in compiler flags but they are actually
different source codes. Specifically, the "optimized C" version doesn't use the faster random number generator.
If you fix that, on my machine it's 17.3 seconds for the base C version and 13.4 for the optimized one, i.e., a 22% improvement from turning on the extra optimizations (-march=native and -ffast-math).
And for whatever it's worth, because some people love hating on GCC in favor of Clang, my Clang timings are 19.5 and 16.5 seconds, respectively.