Basically we tried that and it sort of works, but performance degrades pretty fast with each frame of delay. The issue is likely that it makes credit assignment much harder. Instead of seeing an immediate change in the state (which your critic can interpret as good or bad), you have to wait a bunch of frames during which your previous actions are taking effect and interfering with the reward signal.