How do you envision the correctness of these solutions being judged? If by other LLMs, then we run into a problem of infinite descent. If by humans, then you'd need some way to motivate expert or semi-expert humans (so that their ratings are themselves correct) to participate in a massive project of evaluating the correctness of a constant stream of content from content-generators that never sleep.