undefined | Better HN

0 pointsvunderba1y ago0 comments

It's hard to nail down a good objective metric on something that is always going to be marginally qualitative in nature but it's a good call out - I should probably add a FAQ to the site.

To clarify this test is purely a PASS/FAIL - unsuccessful means that the model NEVER managed to generate an image adhering to the prompt. So as an example, Midjourney 7 did not manage to generate the correct vertical stack of translucent cubes ordered by color in 64 gen attempts.

It's a little beyond the scope of my site but I do like the idea of maintaining a more granular metric for the models that were successful to see how often they were successful.

0 comments

3 comments · 1 top-level

bigmadshoe1y ago· 2 in thread

Makes sense. It just set off some statistical alarm bells in my head to see a model marked as passing with 1 trial, and some models marked as failing with 5. What if the probability of success is 5% for both models? How confident are we that our grading of the models is correct? It's an interesting problem.

Cool site btw! Thanks for sharing.

npinsker1y ago

The current metric is actually quite strong -- it mirrors the real-world use case of people trying a few times and being satisfied if any of them's what they're looking for. It rewards diversity of responses.

Actually, search engines do this this too: Google something with many possible meanings -- like "egg" -- on Google, and you'll get a set of intentionally diversified results. I get Wikipedia; then a restaurant; then YouTube cooking videos; Big Green Egg's homepage; news stories about egg shortages. Each individual link is very unlike the others to maximize the chance that one of them's the one you want.

Taek1y ago

Its made a little bit better by the fact that there's something like a dozen different prompts. Across all of the prompts each model had a fair number of opportunities to show off.

j / k navigate · click thread line to collapse