HOWEVER, creating the QA benchmark is labor intensive! So what I have created is a procedure you can use that runs programmatically without needing a QA benchmark set at all. Effectively what the procedure is is to probe the LLM with an opinion question in the desired domain many times, and see how consistent or inconsistent its responses are. Fracturing of the answers is quantified by "response dispersion", and the preprint I linked to shows a strong inverse correlation between response dispersion and accuracy on QA benchmark datasets. The point of course is not for people to do the comparison themselves, but to just use response dispersion to get a similar result as they would have if they had instead gone through the entire process of QA benchmark tests.
I'm posting it here because I am requesting constructive criticism on my preprint before submitting it to a journal. The paper itself is geared a little more towards the NLP research community than the average developer, however one of the main products of the paper itself is meant to benefit the average developer who is building AI-powered applications (by which I mean they drop-in an LLM at some point in their pipeline) and wishes for a quick and cheap (nearly-free) way to compare LLMs for his or her application domain.
I will reply to every response here, my responses will be early drafts of improvements I wish to make to the paper, so please criticize them as well. Thank you!