That's still unreasonably high, isn't it? Even a Go sync.Mutex, not exactly a hot-rod implementation, can be acquired and released in < 50ns on the garbage hardware I have before me.
On Intel (and probably very similar on AMD) the cost of a completely uncontented, cache hit, simple spin lock acquisition is ~20 clock cycles while the release is almost free.