The prompts have to read like good written requirements for something, so they have some degree of specificity.
But the fact that it can follow instructions and carry them out almost certainly could be considered some form of thinking, especially on novel text not on the internet.
See the Rome example on this page: https://oneusefulthing.substack.com/p/feats-to-astonish-and-... This is essentially a completely novel answer to an /r/AskHistorians style question, which I would consider one of the most difficult types of internet text to model, in terms of the amount of understanding and concept webs you need to tie together
Here's another example of GPT-4 doing non-trivial world modelling: How would three philosophers review the TV show Severence? https://i.imgur.com/FBi31Qw.png
(I'm not the person who wrote the grandparent of the present comment.)