Maybe 1 is actually hat you just suggested - an RL approach to select the strategy for 2. Thank you for implementing optillm and working out all the various strategy options, it’s a really neat reference for understanding this space.
One item I’m very curious about is how do they get a score for use in the RL? in well defined games it’s easy to understand but in this LLM output context how does one rate the output result for use in an RL setup?