I think you must be right. I think this is the complete sequence of events for the trolley problem neighborhood of latent space:
Somewhere in the network is encoded the concept of a trolley problem, and pulling the lever is encoded as the proper response. Then it also has structures encoded in there that help it take the input text and rephrase it to match a given grammar which it has predicted to be the proper output. It works backwards from there to construct a response where it correctly describes the outcome of pulling the lever, as presented in the prompt - because it's just your own words coming back to you. It explains itself and it's reasoning at much as possible, so it regurgitates some ethical principles like that it's better to sacrifice 2 to save 1 in it's conclusion.
This is why it will contradict itself by telling you to pull the lever to sacrifice 2 to save 1, and then tell you it's better to say best to save as many lives as possible. From ChatGPT's perspective, it's followed the rules in it's books to a T, and as far as it's concerned, the notes it's sliding under the door are fluent Chinese.
I think people more or less said this in other places in the thread, but it didn't click with me until you highlighted it.
It's really just a high tech version of Pepper's Ghost, with regression instead of mirrors and staging.