undefined | Better HN

0 pointsfamouswaffles2y ago0 comments

GPT-4's zero shot Human Eval score was 67%

0 comments

4 comments · 1 top-level

lhl2y ago· 3 in thread

While that's what the Technical Report (https://arxiv.org/pdf/2303.08774v3.pdf) says, but GPT-4 out in the wild's (reproducible) performance appears to be much higher now. Testing from 3/15 (presumably on the 0314 model) seems to be at 85.36% (https://twitter.com/amanrsanger/status/1635751764577361921). And the linked paper from my post(https://doi.org/10.48550/arXiv.2305.01210) got a pass@1 of 88.4 from GPT-4 recently (May? June?).

Out of curiousity, I was trying out gpt-4-0613 and claude-v2 with https://github.com/getcursor/eval, but sadly I'm getting hangs at 3% with both of them (maybe hitting rate limits?).

gcr2y ago

do we have evidence that OpenAI is making new versions of gpt4 available? The training data presumably hasn’t changed since 2021 and the model is absurdly expensive to train; there’s little incentive for them to keep touching it up.

lhl2y ago

Well there's OpenAI's release notes for one: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Pre-training of a foundational model is what you're thinking of for the "absurdly expensive" part but fine tunes are extremely cheap and undoubtedly are being done constantly. (You can see just how cheap by looking at the papers for Alpaca, Vicuna, Koala, etc). Prices dropped from about $600 to $10 for smaller models. Guanaco, using QLoRA, fine tuned llama-65b in about 1 day on a single GPU.

Another way to empirically test btw is to search for all the articles pointing out what ChatGPT gets wrong (3 or 4). I recently tested those when looking for evals and it gets the large majority (maybe 80-90% of those are answered correctly now).

jiggawatts2y ago

The issue with all of the chat-optimised LLMs is that they can’t be incrementally updated.

After the base training there are three seperate sets of additional training to align the model, convince it to do question-response, and to improve the quality via feedback.

If you update the original base model, then all the tuning steps need to be repeated.

For a model the size of GPT-4 this is expensive and slow, which is why OpenAI hasn’t bothered.

1 more reply

j / k navigate · click thread line to collapse