EDIT: Look at C code assembly, it's generating mostly SIMD instructions and using xmm registers. That's why it's faster. Golang compiler still do not have autovectorization implemented that's why it's so much slower in this case.
EDIT2: It seems Go version also uses SSE here, which is nice. So probably unnecessary allocation from my original post was the reason.