There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.
People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.
Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.
It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.
So LLM/GenAI crave. An entire article to show that it's nearly there, yet it's not, despite convoluted effort to make it just so on a very very niche example.
But I'm forseeing the opposite. This kind of tool use will soon be integrated and hidden such that people will eventully say "see we solved the problem that AI can't do 123+456, now we are really really close to AGI. Yeah no, with an AGI, it would have been the AGI itself that would have come up with needing at tool, building the tool and then using the tool. But that's not what LLMs are. They are statistical machines to predict tokens. They are very good at it, but that's not an AGI.
It seems to be a very effective pattern. Curious if there are other examples out there. Or other names for this?
1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guides the denoising process of the diffusion model to put the numerical parts in the right spots.
The LLM is putting the SVG in the right order because the code that drives the SVG is just that - code - and the numerical order is easily defined there, even if it has to follow something like a spiral.
Edit: although LLMs now also may be using thinking modes with their feedback during generation to help with complex positioning when drawing something like an SVG, as I just asked claude to generate me one such spiral number SVG and it did so interactively via thinking, and the code generated is incredibly explicit with positions, so, that must help. But the underlaying idea to two-step SVG-to-diffusion model is the real key here.
At the end of the day we can get so much done just by breaking down a problem into smaller problems.
It should be fairly trivial to fix any logic errors in the structured output, too.
LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months
Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.