The COW and short string optimizations are not mutually exclusive. If we assume short string optimization is implemented both before and after, then we are back to comparing the atomic increment to allocation. And different allocation approaches can make the cost of heap allocation differ quite substantially. I'd fully expect that some allocation approaches are cheaper than the cache line invalidation from atomic increment, but some others that tend involve a lot of pointer chasing can be rather costly.
Certainly plenty of widely copied strings are short strings, so a COW implementation that lacks the short-string optimization could very easily be a bad bottleneck for multi-core compute.