undefined | Better HN

0 pointspartiallypro1y ago0 comments

A lot of the demo is very impressive, but some of it is just stuff that already exists but this is slightly more polished. Not really a huge leap for at least 60% of the demos.

0 comments

muzani1y ago

The super impressive stuff is more subtle:

1. Price is half for a better performing model. A 1000x1000 image costs $0.003.

2. Cognitive ability on visuals went up sharply. https://github.com/kagisearch/llm-chess-puzzles

It solves twice as much despite a minor update. It could just be better trained on chess though, but this would be amazing if it could be applied to the medical field as well. I might use it as a budget art director too - it's more capable of knowing the difference in subtle changes in color and dealing with highlights.

mlyle1y ago

I'm not sure for text it's a better performing model. I was just testing GPT-4o on a use case (generating AP MCQ questions) and -4o is repeatedly generating questions with multiple correct answers and will not fix it when prompted.

(Providing the history to GPT-4Turbo results in it fixing the MCQ just fine).

muzani1y ago

After some testing, I find it's not as good as code too. It is better at some things, but benchmarks don't tell the whole story apparently.

1 more reply

maeil1y ago

The benchmark you're linking in 2 is genuinely meaningless due to it being 1 specific task. I can easily make a benchmark for another task (that I'm personally working on) where e.g. Gemini is much better than GPT4-Vision and any Claude model (not sure about GPT-4o yet) and then post that as a benchmark. Does that mean Gemini is better at image reasoning? No.

These benchmarks are really missing the mark and I hope people here are smart enough to do their own testing or rely on tests with a much bigger variety of tasks if they want to measure overall performance. Because currently we're at a point where the big 3 (GPT, Claude, Gemini) each have tasks that they beat the other two at.

muzani1y ago

It's a test used for humans. I personally am not a big fan of the popular benchmarks because they are, ironically, the narrow tasks that these models are trained on. In fact, GPT-4o performance on key benchmarks has been higher, but on real world tasks, it has flopped on everything we used other models on.

They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.

GPT tends to work with whatever you throw at it while Gemini just hides behind arbitrary benchmarks. If there are tasks that some models are better than others at, than by all means let's highlight them, rather than acting defensive when another model does much better at a certain task.

llm_trw1y ago

I'm reminded of people talking about the original iPhone demo and saying 'yeah, but this is all done before ...'. Sure, but this is the first time it's in a package that's convenient.

partiallyproOP1y ago

How so? It's obvious convenient to for it to all be there on ChatGPT, but I'm more commenting on the "this is so Earth shattering" comments that are prevalent on platforms like Twitter (usually grifters,) when in reality while it will change the world, much of these tools sets already existed. So, the effect won't be as dramatic. OpenAI has already seen user numbers slip, I think them making this free is essentially an admission of that. In terms of the industry, it would be far more "Earth shattering" if OpenAI became the defacto assistant on iOS, which looks increasingly likely.

llm_trw1y ago

This is earth shattering because _it's all in the same place_. You don't need to fuck around with four different models to get it working for 15 minutes once on a Saturday at 3am.

It just works.

Just how like the iPhone had nothing new in it, all the tech had been demoed years ago.

1 more reply

j / k navigate · click thread line to collapse

0 comments

muzani1y ago

The super impressive stuff is more subtle:

1. Price is half for a better performing model. A 1000x1000 image costs $0.003.

2. Cognitive ability on visuals went up sharply. https://github.com/kagisearch/llm-chess-puzzles

mlyle1y ago

(Providing the history to GPT-4Turbo results in it fixing the MCQ just fine).

muzani1y ago

After some testing, I find it's not as good as code too. It is better at some things, but benchmarks don't tell the whole story apparently.

1 more reply

maeil1y ago

muzani1y ago

They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.

llm_trw1y ago

I'm reminded of people talking about the original iPhone demo and saying 'yeah, but this is all done before ...'. Sure, but this is the first time it's in a package that's convenient.

partiallyproOP1y ago

llm_trw1y ago

This is earth shattering because _it's all in the same place_. You don't need to fuck around with four different models to get it working for 15 minutes once on a Saturday at 3am.

It just works.

Just how like the iPhone had nothing new in it, all the tech had been demoed years ago.

1 more reply

j / k navigate · click thread line to collapse