I ran a simple experiment to try and understand whether self-rated answer confidence reflects the actual probability of the LLM generating that answer.
I've always been skeptical of prompting techniques that ask the LLM to output a score or a confidence level numerically. The results from this experiment suggest that LLMs tend to understate their own confidence, and that "self-rated" scores prompted from LLMs may be generated more based on what the LLM thinks is a "safe" answer rather than an accurate representation.
The reason I'm curious about this area is because the startup I'm building does AI-powered E2E testing, and I'd like to more objectively figure out when a decision made by the agent is low-confidence so that it can be re-assessed.