I have to disagree with the term objections; my usage here has a precise meaning: when you train large (and efficient) neural networks, abstract internal representations emerge. Emerge simply means they weren't programmed explicitly, but the training procedure and other properties are so it will arise.
I agree that's a plausible way to build an agent. In my line of questioning, I would wonder if it could really perform in the real world, due to limitation of corpus, and really lacking training as an agent (and not just a text predictor). This could be addressed using simulated (and possibly multi-agent) training environments and the like.