That doesn't mean that dmitrygr is correct. It means that everyone trying to answer him is arguing about the wrong thing.
The hard limit on instruction size is 15 bytes, so a 64-byte cache line will always be able to store at least 4 of them. (Or 3 plus the tail of an instruction from a previous line.) Meanwhile, on the other end, Intel cores can only retire up to 4 μops per cycle. Since each instruction takes at least 1 μop (except for macro-fusion, which only works on short instructions), retirement will always form a bottleneck before decoding can.
And in realistic code where you'd actually see these long instructions, i.e., hot SIMD loops, all the decoded instructions would stay warm and toasty in the μop cache (allegedly holding 6 fixed-size μops per cache line) after the first iteration.
I believe in chip design, this doesn't really happen (often). You can optimize the bottlenecks by allocating it more space and power.
I interpret Keller's statement indirectly - given that modern x86 CPUs dedicate only a small part of its circuitry to decoding logic means that it's not a bottleneck (otherwise there would be more circuitry for it).
The decode difficulty may make a 5% difference, but add in the other things people have mentioned and maybe it adds up to 30%. (numbers pulled out of my arse)