Cf. Plato’s Cave
The word to explain this is emergence. This indeed not quite intuitive, but neural networks exhibit many phenomena of emergence. It is tied to their ability to perform effective/efficient computation -- after all the "goal" of nature with our brain design (from which abstractions and knowledge emerges) was also effective cognition. For example, when you feed a large convolutional neural (classifier) network diverse objects, and human faces, you can verify experimentally the convolutional filters resemble "concepts", subdivisions used to assemble a larger whole (nose, eyes, mouth, etc. are the components of a face). That's the strategy of dividing and conquering, a basic aspect of efficient cognition/computation. The network has enough neurons and a good prior structure[1], that this effective architecture emerges from gradient descent training. It really is wonderful. You can see it as a primitive/rough, but powerful, form of algorithm search (or algorithm optimization). The best algorithms tend to employ abstractions.
Natural internal representations emerge.
Emotions probably don't fully emerge (in the whole breadth of emotions), although they may exist as internal representations when dealing with human bodies of work. That's because emotions are tied to our motivational system: they compose qualities (qualia) that propel us to do various activities, generally (but not always) tied to straightforward evolutionary beneficial goals: enjoying eating, craving sleep, having sex, engaging the community (mammals rely heavily on group for survival), etc.
Without agency, it's unlikely (but I can't say with certainty) those emotional qualia would emerge with accurate fidelity, simply because a non-agent model wouldn't employ those to function, wouldn't optimize for the same functions. The extent of emergence is limited to understanding and reproducing the human production, not accurately replicating its exact (internal) quality, that's derived from its computational structure and relationship with motivation. It (in this case, GPT-3) only needs to understand those human emotions insofar as predicting human behavior to a reasonable accuracy. As the corpus goes to infinity, with a sufficiently diverse expressive[2] literature, you could conjecture emergence is guaranteed[3] (just how large a corpus would we need though? Who knows). But I find it likely in practice you really need to set up the network with agency (and train it adequately to exhibit effective, motivated, behavior) before it starts reproducing well those qualities, i.e. deriving some of its understanding from practical situations that are too sparse (or absent from) the training corpus.
[1] In fully connected networks, usually a funnel or hourglass shape; in convolutional networks, a decreasing size in the 2D image domain, and increasing size across it, as if images were transforming into concepts; this structure is baked in usually (although you can do hyperparameter optimization, etc).
Finally, agency is essentially impossible to emerge (unless your training code has serious bugs, but I can't find a plausible way) from a (purely) predictive/generative neural network. There is simply no concept of itself, less so of its own goals, nowhere in its structure (only the concept of other persons/characters/things, or even some understanding of agency of other things) or training objective. Worse, it never has the opportunity to exercise this goal-oriented behavior in a comprehensive setting (again this depends on training corpus).
[2] In terms of expressing internal states, and accurately describing our actions.
[3] Then there's the question of whether the representational structure we have is a unique solution -- i.e. whether there are other ways of feeling the same feelings while acting exactly the same.
Obs: I intend to flesh this argument into an article and post here later -- I find it a quite recurrent doubt on neural network behavior.
[1] https://www.lesswrong.com/posts/8QzZKw9WHRxjR4948/the-futili...
[0] Assuming you have some utility function you can maximize.
Essentially, the missing pieces in the picture come down to input and output modules. "How do you formulate any given problem into a form that a language model can answer?".
Language is only part of it. And you can't get complete understanding without integrating spatial information. Take a look at Josh Tenenbaum's work for explanation of why.