Computers are shockingly complex. I can't tell you how many times I've reasoned about a system, ran the profiler, and discovered I was completely wrong.
When I was working on an interpreter for a Lisp, I implemented my first cut of scopes (all the variables within a scope and their values) as a naive unsorted list of key/value pairs, thinking I'd optimize later. When I came back to optimize, I reimplemented this as a hashmap, but when I ran my test programs, to my horror, they were all 10x slower. I plugged in a hashmap library used in lots of production systems and got a significant 2x performance gain, which was still slower than looping over an unsorted list of key/value pairs. The fact is, most scopes have <10 variables, and at that size, looping over a list is faster than the constant time of a hashmap. I can reason about why this is, but that's just fitting my reasons to the facts ex-post-facto. Reasoning didn't lead me to the correct answer, observation did.
Returning to parallel data structures, the fact is, I don't know why lock-free structures are faster than mutex-based structures, I just know that they are in every situation where I've profiled them.
Reasoning isn't completely useless--reasoning is how you intuit what you should be profiling. But if you're just reasoning about how two alternatives will perform and not profiling them in real-life production systems you're wasting everyone's time.