1. Price is half for a better performing model. A 1000x1000 image costs $0.003.
2. Cognitive ability on visuals went up sharply. https://github.com/kagisearch/llm-chess-puzzles
It solves twice as much despite a minor update. It could just be better trained on chess though, but this would be amazing if it could be applied to the medical field as well. I might use it as a budget art director too - it's more capable of knowing the difference in subtle changes in color and dealing with highlights.
(Providing the history to GPT-4Turbo results in it fixing the MCQ just fine).
These benchmarks are really missing the mark and I hope people here are smart enough to do their own testing or rely on tests with a much bigger variety of tasks if they want to measure overall performance. Because currently we're at a point where the big 3 (GPT, Claude, Gemini) each have tasks that they beat the other two at.
They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.
GPT tends to work with whatever you throw at it while Gemini just hides behind arbitrary benchmarks. If there are tasks that some models are better than others at, than by all means let's highlight them, rather than acting defensive when another model does much better at a certain task.
It just works.
Just how like the iPhone had nothing new in it, all the tech had been demoed years ago.