1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.