I'm using CLIP here generically to refer to families/models generating captions by leveraging CLIP as the encoder - of which there are plenty on "The Hub".
Have you actually done the approach I think you're suggesting for anything more complex than "this is a yellow cat"? Not trying to be snarky, genuinely curious. I've done a few of these projects and this approach never comes close to meeting user expectations in the real world.