There is definitely something symbol-adjacent that needs to happen inside of the model; this is what I assume happens in the brain. But it is not purely symbolic.
For instance, consider voicing the end of a letter: “I will definitely not be stabbed in the bac…” (where the word "back" quickly devolves into a line that crosses through the rest of the letter). It goes from symbolic to contextual, implying that the author was stabbed midway through writing it, so the voicing must end with a yell of playful agony.
The same goes for calligraphic art, such as the Al Jazeera logo, for instance, which is intended to be understood as both a sequence of Arabic letters, and a depiction of a fire. A model seeing this image for the first time, needs to see it both ways at the same time.
But it’s true that we can’t just throw a transformer at the problem, train it from scratch with video inputs and audio outputs, coupled with a sporadic reward, and suddenly have it be able to solve scans of civil engineering exams. The brain can do it, but not silicon (yet). It is easier to combine models that were trained on simpler losses (tokenized cross-entropy) on simpler problems (next-token prediction), and combine them. Not true AGI learning, but eventually it will fool people into believing it is.