simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.
hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.