undefined | Better HN

0 pointslambdaba2y ago0 comments

Claude 3 has *decisively* beat GPT-4, I wonder how all their attributes compare.

0 comments

Has it, though? LMSys Arena Leaderboard (blind ranking by humans) [0] positions Opus just below GPT-4 with a negligible ELO gap.

[0] https://chat.lmsys.org/

espadrine2y ago

A number of AI companies have a naming/reproducibility issue.

GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.

Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.

In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.

staticman22y ago

That "blind ranking" is limited to about 2,000 tokens of context. So it's certainly not evaluating how good the models are at complex assignments.

BoorishBears2y ago

Chatbot Arena is not a blind ranking.

Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.

On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.

I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.

swalsh2y ago

I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.

lambdabaOP2y ago

It understands instructions better, it's rarer to have it misunderstand, and I have to be less careful with prompting.

stainablesteel2y ago

i like some of claudes answers better, but it doesnt seem to be a better coder imo

simonw2y ago

I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949

How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.

asciii2y ago

> according to your personal prompting style though

I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews

bugglebeetle2y ago

What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.

2 more replies

furyofantares2y ago

I've found it significantly better than GPT4 for code and it's become my go-to for coding.

That's actually saying something, because there's also serious drawbacks.

- Feels a little slower. Might just be UI

- I have a lot of experience prompting GPT4

- I don't like using it for non-code because it gives me to much "safety" pushback

- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently

I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.

htrp2y ago

citation needed (other than 'vibes')

1 more reply

j / k navigate · click thread line to collapse