Probably should try more base models, given that the weaknesses sound characteristically mode-collapsed, like ChatGPTese. (Prompt engineering tweets with base models is near trivial - just include a bunch of tweets.)
It would also be more relevant to examine win rates by generating a large batch of candidates and using the best rated one. There is no point optimizing for mean win rate, when you actually care about the best possible tweet you can get out - you can't tweet every tweet a LLM can generate to begin with, and you need the best possible tweet, not reliable mediocrity.