Perhaps a probabilistic FSM describes the actual computational process better since we don’t have a concept equivalent to superposition with transformers (I think?), but the framework of a FSM alone doesn’t seem to capture the specifics of where the model/machine comes from (what I’m calling the Hamiltonian), nor how a given context window (the subsystem) relates to it. The change of basis that involves the attention mechanism (to achieve context-awareness) seems to align better with existing concepts in QM.
One might model the human brain as a FSM as well, but I’m not sure I’d call the predictive ability of the brain an implementation detail.