You would probably need a multi modal AI that could do natural language, vision, and control of the robot. At that point it's probably as smart as a dog or any animal other than humans. And maybe you could argue there isn't any fundamental difference between dog level ai and human level ai, it's just a "scaled up" version.
I think what you mention though is more like gluing an LLM with the current stack. But I doubt that will ever be enough, you probably need a multi modal model.