CPUs can execute many, many, many instructions in parallel. If all your data fits inside of L1 cache (4-clocks of latency), its actually pretty easy to achieve 2-instructions per clock or more !
Furthermore, modern CPUs are out-of-order processors. So the processor will automatically execute independent instructions to "fill up your latency", at least to some extent.
CPUs have enough space to even handle main memory fetches (over 200+ reorder buffers on Skylake, to handle the 200+ clocks of latency on a DDR4 memory read or write). As long as you have "enough independent work to do," its not too bad. Compilers usually figure out independent chunks of work as they unroll loops for example.
In my experience, the loop accounting (for int i=0; i<100; i++) will all execute inside of that latency in parallel to the work inside of the loop. So there's almost always work to do, at least at the ~5 clocks to 10-clocks worth of "misc" functions in any bit of code.
The hard part is coming up with work to do for ~50ns of latency (ex: DDR4 Reads or Writes).