Firstly, thanks for the question. As mentioned, not a CPU designer or trying to teach Intel what to do. More like relying on the hive mind to see if I have the right idea.
A second instruction in the pipeline would read from the above mentioned L0 cache (let us call it load buffer), much like it would for tentative memory stores from the store buffer.
Also, two memory fetches in parallel are not twice as long as a memory fetch, if that would be the solution (which I guess would not be the case, as I imagine race conditions appearing)