undefined | Better HN

0 pointsbaby2y ago0 comments

I'm wondering how do people compare different models? I've been trying chatGPT 3.5, bing chat (chatgpt 4 I believe?), and bard, and now this one, and I'm not sure if there's a noticeable difference in terms of "this is better"

0 comments

9 comments · 4 top-level

jimmySixDOF2y ago· 4 in thread

Try the Chat Arena with ELO ratings based on end user side by side blind tests. It run out of UC Berkley by lmsys the same team that released Vicuna.

https://arena.lmsys.org/

babyOP2y ago

This is awesome! So basically GPT-4 is the winner far ahead of alternatives. I don't see Bard in the ranking though

netsec_burn2y ago

It's outdated.

stavros2y ago

That's a terrible system, it doesn't represent gaps in performance. If the first model is orders of magnitude better than the second, that system still says "99% as good" or whatever.

sebzim45002y ago

The relative difference between elo ratings is meaningless, you need to look at the absolute difference.

dotancohen2y ago· 1 in thread

Depends on the task. For code, ask it to implement a not-difficult but not-trivial feature. "Please add hooks to the AnkiDroid source code so that addons would be viable" might be a good start, for something that is on my mind. Then compare implementations.

For checking hallucinations, ask it about events and trivia that happened eons ago, and also within the last decade. Try some things that it can not possibly know, like how much celery Brad Pitt likes in his salad.

rajko_rad2y ago

This is an emerging space with lots of interesting tools coming out... There are many established benchmarks out there (i.e. included on front page of llama2 release), but most product builders have their own sets of evals that are more relevant to them...

Here is a thread exploring differences between llama-v2 vs. gpt3.5: https://twitter.com/rajko_rad/status/1681344850510376960

losteric2y ago

Develop a set of queries for the use-case with human review of outputs. My team has an internal (corporate) tool where we drop in an S3 file, complete text over K models, then evaluate the completions with appropriate humans labor pools. Each evaluator gets a pair of outputs for the same prompt and picks the best.

kcorbitt2y ago

It depends -- do you mean as a general end-user of a chat platform or do you mean to include a model as part of an app or service?

As an end user, what I've found works in practice is to use one of the models until it gives me an answer I'm unhappy with. At that point I'll try another model and see whether the response is better. Do this for long enough and you'll get a sense of the various models' strengths and weaknesses (although the tl;dr is that if you're willing to pay GPT-4 is better than anything else across most use cases right now).

For evaluating models for app integrations, I can plug an open source combined playground + eval harness I'm currently developing: https://github.com/openpipe/openpipe

We're working on integrating Llama 2 so users can test it against other models for their own workloads head to head. (We're also working on a hosted SaaS version so people don't have to download/install Postgres and Node!)

j / k navigate · click thread line to collapse