This seems like the furthest away part to me.
Put ChatGPT into a robot with a body, restrict its computations to just the hardware in that brain, set up that narrative, give the body the ability to interact with the world like a human body, and you probably get something much more like agency than the prompt/response ways we use it today.
But I wonder how it would do about or how it would separate "it's memories" from what it was trained on. Especially around having a coherent internal motivation and individually-created set of goals vs just constantly re-creating new output based primarily on what was in the training.