We can take fMRI scans when people are looking at images and generate blurry blobs that do indeed resemble the images spatially.
We can predict a text label of the image the person is looking at using another technique.
If you use SD just on the text labels and you generate an image, you get the semantic content, but not the special content.
If you combine the image and the text label and run it through an LDM then you get pictures that more closely match both the semantic and spatial characteristics of the images shown to the person.