GPUs are parallel processors. So, yes, synchronization primitives are the highest priority.
We focused on things that require /different/ implementations in host and device code.
The way you implement std::binary_search is the same in host and device code. Sure, we can stick `__host__ __device__` on it for you, but it's not really high value.
Synchronization primitives? Clocks? They are completely different. In fact, the machinery that we use to implement both the synchronization primitives and clocks has not previously been exposed in CUDA C++.