The part where they talk about sampling multiple runs is interesting - it suggests to me that in the next few years as the reasoning process is improved the models may be able to do that autonomously.
My mind really is going to using a dedicated object detection models fine-tuned with nutrition information, but I don't think there's a fundamental reason LLMs can't eventually manage this use case, except perhaps the size of the needed weights being prohibitively large.