GPT-4-turbo preliminary benchmark results on code-editing (opens in new tab)

(aider.chat)

74 pointsheliophobicdude2y ago91 comments

91 comments

55 comments · 10 top-level

exo-pla-net2y ago· 10 in thread

So it appears that GPT-4-Turbo is indeed (at least marginally) smarter than the previous GPT-4, just as Altman claimed. Also, it's faster and cheaper, with a massive context window. Exciting!

berkut2y ago

I haven't tried it yet, but people in the /r/chatgpt subreddit are claiming GPT-4-Turbo seems to have issues with understanding/remembering longer (say 100 lines) of code, whereas 3.5 and 4.0 seem to have handled things a bit better, implying that the context-window size isn't (currently) as large as claimed.

Anyone else seeing any evidence of this?

ramblerman2y ago

Some specialized subreddits can be incredibly useful. /r/chatgpt due to its popularity is not one of those.

It's full of memes and people complaining its not as "good" as it was yesterday when it fails at completing their homework.

I would take anything said there with a big grain of salt, and stick to benchmarks.

bbahn2y ago

The context window IS longer, but it's less powerful. Obviously, they can't afford to have full transformer context over the entire context. That would be an impossibly large amount of ram. They're using some combination of sliding window/cyclical/or some other adjusted attention mechanism likely with some degree of summarization in some manner.

japhyr2y ago

The increased context size will have the most significant impact on my work. That's where I run into limitations, when reviewing written work and code. I've been feeding written work into GPT in chunks, and I'm really happy to be able to feed in whole pieces. (I don't have it revise anything for me, I just have a specific prompt for exactly the kind of feedback I want on written work.)

I tried Claude because of the larger context size, but I've been disappointed so far. I find Claude much more likely to just compliment my writing, whereas GPT will identify strengths and areas that could be improved.

pests2y ago

Have you tried aider-chat? It does some interesting things with tree-sitter so it can give the LLM a context (files, classes, functions, parameters, etc) as well as certain full files. That way it has your entire codebase in API form and it can focus on the actual code you are looking at or editing.

doctoboggan2y ago

Yes I was really worried this was going to be "better" as in faster and cheaper but not "better" as in smarter. I've been playing with it for the past day and haven't noticed it any smarter per se, but also haven't noticed it dummer either.

kridsdale32y ago

It's a big surprise. I assumed something would have to have been "lobotomized" to make this speed increase. I'd love to know what they did.

bigyikes2y ago

Well, the knowledge cutoff is much more recent (I think sama said April 2023?), so having more, newer data might be a significant contributing factor.

Racing04612y ago

If this is the case, why use the word turbo in the name which has the baggage of faster but worse reasoning.

jquery2y ago

Perhaps it may have worse reasoning for some tasks, so having "turbo" lets people know they can still try the old version if turbo doesn't work for them? Kind of like 3.5 vs 3.5 Turbo.

vouaobrasil2y ago· 10 in thread

Programmers here seem excited about the potential of this new version...but I can't help but wonder at how naive this attitude really is. Even if AI never becomes intelligent like us, if it can emulate this intelligence in enough domains, then it has a serious chance of being dangerous. It's already pretty much guaranteed that it will put almost everyone out of a job, turning the vast majority of humans into content-consuming sloths.

Does it really make sense to play with this kind of power?

p1necone2y ago

It bothers me a little bit as a programmer, but the rational part of my mind is aware that there's literally never been a major technological advancement in human history that didn't ultimately result in more people being employed, not less. Improvements in productivity just result in people realizing there's a whole new world of stuff they can now build that wasn't feasible before, and the market will eat that up.

The shape of my career might change, but I doubt I'll be unable to find a job.

TerrifiedMouse2y ago

> there's literally never been a major technological advancement in human history that didn't ultimately result in more people being employed, not less.

In the past, new automation technologies often open up new possibilities in production capabilities in turn creating new jobs - specifically jobs that have not been automated yet.

AI though promises to be the universal automation, i.e. it can do any job. Thus even if new jobs show up, they will be taken over by AI too.

Then what?

> The shape of my career might change, but I doubt I'll be unable to find a job.

Question you should ask is why would anyone hire you when they can get AI to do the same job.

2 more replies

jquery2y ago

Ask horses how they're doing these days... just because we've always found a use for humans because of their unique cognitive abilities doesn't mean we always will find a use.

I think it's critical to be thinking about how to make sure wealth isn't simply funneled to a few capitalists who own everything simply by virtue of them being first, because it seems that's the future we're heading for if we aren't careful.

I think you, and I, and everyone on HN will be fine (more or less...) but I am worried about a wide swath of people who will get "left behind."

1 more reply

jorgemf2y ago

You are assuming that the whole existence of humanity is to work? because, without working, they would be sloths? What about expending more time having healthy habits like working out, meeting more often with family and friends, discovering the world, learning new stuff? So retired people are just sloths?

TerrifiedMouse2y ago

I’m more worried about people not being able to feed themselves because their labor became worthless. They will effectively be frozen out of the economy as they have nothing to trade with.

2 more replies

ketzo2y ago

It’s way too late to pretend AI won’t have an impact on programming as a profession.

Better to be excited and learn the tool as it develops than to stick your head in the sand.

A world where AI has put literally everyone “out of a job,” meanwhile, is still so far from our current reality that IMO, it’s not worth making practical day-to-day decisions on unless you are directly involved in the development or regulation of AI.

TerrifiedMouse2y ago

> meanwhile, is still so far from our current reality

2 years ago something like ChatGPT (as limited as it is) was “far from our current reality”.

I think it’s worthwhile to think ahead.

1 more reply

vouaobrasil2y ago

That is exactly the prisoner's dilemma. Besides, I already quit being a programmer this year :)

1 more reply

Turing_Machine2y ago

Analogous viewpoint in the Eighteenth Century: "90% of the human race consists of agricultural laborers, mostly unfree (slaves or serfs). Steam-powered machinery will put almost all of those people out of work, turning the vast majority of humans into food-consuming sloths".

Somehow that didn't happen, though.

an_aparallel2y ago

it didnt happen?? not sure if you missed the /s ??

1 more reply

xeckr2y ago· 6 in thread

Back in April it would only generate a handful of tokens per second. The speed improvements for GPT-4 are staggering. I wonder how much of it is because Microsoft is making GPUs rain on OpenAI, and how much of it is due to improvements to the model and its scaffolding.

kridsdale32y ago

Perhaps we owe it to Nvidia. Perhaps a huge batch of H100 arrived and the premium models can run there instead of the A100.

jiggawatts2y ago

H100 + quantisation + algorithmic improvements would be sufficient to explain the speed boost.

If you "have enough compute" available -- which OpenAI definitely does -- the best current technique is to use mixed precision with post-quantisation fine tuning to restore performance. That's most probably how all of the "turbo" models work. Take a model that was initially 16 or 32 bits per parameter during training, quantise it down to a mixture of 4, 8, and 16 bits, and then fix it up with an additional training pass that uses the original full-fat model's predictions as the loss function. With access to the raw parameters, it's possible to do this training such that all of the output weights are considered and adjusted during this phase instead of just the top word. Third parties fine-tuning against GPT4 chats can't do this, even with the collected samples, because they only have individual selected tokens/words instead of the full probability distribution.

1 more reply

smodad2y ago

Based on rumors, it seems that because the demand is so high and supply so short, that Nvidia is having to select who gets the cards. I bet they're likely thinking about who can do the most impactful work and OpenAI would definitely be in the running for the short list of companies who are actually shipping. So I bet OpenAI / Microsoft got a lot of the newest cards.

1 more reply

andy_xor_andrew2y ago

I think the general (completely speculative and unconfirmed) consensus is that "Turbo" models are somehow quantized, or otherwise modified for much faster and cheaper inference, at the expense of some quality (the definition of "some" is unknown here).

well, that's how it was for GPT-3.5 anyway. The "turbo" flavor was faster and cheaper, but seemed to have slightly worse output (again, this is all going by subjective measurement; it could entirely be the imagination of AI bros)

xeckr2y ago

I agree that this is the case for GPT-3.5, at least subjectively. However, with GPT-4 Turbo, it seems that performance has improved. If they got it to be faster using quantization, then they must have also found a way to offset any resulting performance losses.

letitgo123452y ago

They have tons of usage data by now to figure out which queries to devote model capacity to

Racing04612y ago· 6 in thread

reddit thread on the opposite experience - https://www.reddit.com/r/ChatGPT/comments/17prwlg/gpt4_turbo...

a_wild_dandan2y ago

We can always count on Reddit to provide the cold, hard anecdotes.

crooked-v2y ago

From the anecdotes there, it sounds a whole lot like something is going on where it's internally doing a summarizing step to fit into a much smaller actual context window than 128K.

vunderba2y ago

That Reddit user mentions something that I have noticed happening with infuriatingly increasing frequency in the last few weeks (at least on the chat GPT interface) - the "fill in the rest here" or "remainder of code" snipped sections even when you EXPLICITLY instruct ChatGPT to include the fully modified code in its responses.

It's like the AI equivalent of "the rest is left as an exercise to the reader" you'd find in old textbooks.

Racing04612y ago

Agreed, this is single handedly making me hate chatgpt for coding. I wish OpenAI would respect it's paying users enough to say what the changes and limitations are. (but i guess like google, i am not openai's customer. Their investors are. i am the product).

gpt4 should really be called DecartesGPT with this bs.

replwoacause2y ago

Same. It hasn’t blocked me from learning with it but it is a hindrance.

meiraleal2y ago

This is so annoying. And it is getting worse. And a great way for OpenAI to hinder competition and then copy the products being developed using their APIs. This is openai cheating on their userbase.

jpdus2y ago· 5 in thread

For other (non-code) benchmarks, people are having the opposite experience:

"I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test:

- GPT3.5 - 690 (10 wrong) - GPT4 - 770 (3 wrong) - GPT4-turbo (one section at time) - 740 (5 wrong) - GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"

Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P...

dazzaji2y ago

Does anybody know if 2008-2009 SAT is in the training set for these models? Assuming so, I’d be especially interested in head-to-head evals on this type of non-code benchmark for problem sets not already in the training data, to see how it performs on fresh situations.

rafaelero2y ago

Probably not a statistically significant difference there.

exo-pla-net2y ago

N=1

Terretta2y ago

What did you mean by "opposite"?

You seem to be suggesting it got a bit worse, and the aider article seems to suggest gpt4 got a bit worse, although much faster at being a bit worse, while gpt3.5 got worse, then better, while faster.

reitzensteinm2y ago

The Aider article has been updated with the complete results. Previously Turbo was leading slightly. So far any difference is in the noise.

However, in my opinion the first attempt score is more important, and Turbo does genuinely seem to lead there. There's still a possibility the updated training data has tainted the results.

cloudking2y ago· 5 in thread

Has anyone been able to access the 128k context window? I'm not seeing that option in the API playground

throw031720192y ago

Yes, it’s fantastic. It is called gpt-4-1106-preview

infecto2y ago

You might want to double check. The latest turbo models for 4 and 3.5 don’t differentiate. There is only one model and it has 128k context.

kridsdale32y ago

I wonder if 3.5 / 4 is really just "agents are on" or not.

thequadehunter2y ago

The reason he's not seeing it is because it's only available as an API endpoint. It doesn't show up in playground.

3 more replies

johnxie2y ago

Yes, it should be under gpt-4-1106-preview.

Racing04612y ago· 3 in thread

Is this just the api or does it work on chatgpt also?

adamsmark2y ago

You can try it out in the playground but not the consumer ChatGPT website. You will be limited to 10K tokens per minute if you are not in any spending tier e.g $150 per month will give you bigger context window.

dmm2y ago

I haven't seen any changes in the chatgpt plus interface yet.

replwoacause2y ago

Me either

meiraleal2y ago

The past days ChatGPT went from a great pair programming helper to a useless antipathetic intern, the quality of generated code dropped visibly. The context seems to be bigger in the chatgpt plus version too but it got dumber.

ttul2y ago

The progress here is remarkable. A year ago, we didn’t even have ChatGPT. LLM completions were cool but so hard to use and definitely there was nothing accessible to non-nerds.

kristianp2y ago

Aider sounds like a cool tool, I'll have to try it out. I'm assuming it makes use of your local files and edits them for you?

Are there any other programming assistant packages that use the chatgpt api like this?

Regarding rate limits, it might be an idea to have configurable delays built in to the testing code to prevent hitting limits.

j / k navigate · click thread line to collapse

91 comments

55 comments · 10 top-level

exo-pla-net2y ago· 10 in thread

So it appears that GPT-4-Turbo is indeed (at least marginally) smarter than the previous GPT-4, just as Altman claimed. Also, it's faster and cheaper, with a massive context window. Exciting!

berkut2y ago

Anyone else seeing any evidence of this?

ramblerman2y ago

Some specialized subreddits can be incredibly useful. /r/chatgpt due to its popularity is not one of those.

It's full of memes and people complaining its not as "good" as it was yesterday when it fails at completing their homework.

I would take anything said there with a big grain of salt, and stick to benchmarks.

bbahn2y ago

japhyr2y ago

pests2y ago

doctoboggan2y ago

kridsdale32y ago

It's a big surprise. I assumed something would have to have been "lobotomized" to make this speed increase. I'd love to know what they did.

bigyikes2y ago

Well, the knowledge cutoff is much more recent (I think sama said April 2023?), so having more, newer data might be a significant contributing factor.

Racing04612y ago

If this is the case, why use the word turbo in the name which has the baggage of faster but worse reasoning.

jquery2y ago

Perhaps it may have worse reasoning for some tasks, so having "turbo" lets people know they can still try the old version if turbo doesn't work for them? Kind of like 3.5 vs 3.5 Turbo.

vouaobrasil2y ago· 10 in thread

Does it really make sense to play with this kind of power?

p1necone2y ago

The shape of my career might change, but I doubt I'll be unable to find a job.

TerrifiedMouse2y ago

> there's literally never been a major technological advancement in human history that didn't ultimately result in more people being employed, not less.

In the past, new automation technologies often open up new possibilities in production capabilities in turn creating new jobs - specifically jobs that have not been automated yet.

AI though promises to be the universal automation, i.e. it can do any job. Thus even if new jobs show up, they will be taken over by AI too.

Then what?

> The shape of my career might change, but I doubt I'll be unable to find a job.

Question you should ask is why would anyone hire you when they can get AI to do the same job.

2 more replies

jquery2y ago

Ask horses how they're doing these days... just because we've always found a use for humans because of their unique cognitive abilities doesn't mean we always will find a use.

I think you, and I, and everyone on HN will be fine (more or less...) but I am worried about a wide swath of people who will get "left behind."

1 more reply

jorgemf2y ago

TerrifiedMouse2y ago

I’m more worried about people not being able to feed themselves because their labor became worthless. They will effectively be frozen out of the economy as they have nothing to trade with.

2 more replies

ketzo2y ago

It’s way too late to pretend AI won’t have an impact on programming as a profession.

Better to be excited and learn the tool as it develops than to stick your head in the sand.

TerrifiedMouse2y ago

> meanwhile, is still so far from our current reality

2 years ago something like ChatGPT (as limited as it is) was “far from our current reality”.

I think it’s worthwhile to think ahead.

1 more reply

vouaobrasil2y ago

That is exactly the prisoner's dilemma. Besides, I already quit being a programmer this year :)

1 more reply

Turing_Machine2y ago

Somehow that didn't happen, though.

an_aparallel2y ago

it didnt happen?? not sure if you missed the /s ??

1 more reply

xeckr2y ago· 6 in thread

kridsdale32y ago

Perhaps we owe it to Nvidia. Perhaps a huge batch of H100 arrived and the premium models can run there instead of the A100.

jiggawatts2y ago

H100 + quantisation + algorithmic improvements would be sufficient to explain the speed boost.

1 more reply

smodad2y ago

1 more reply

andy_xor_andrew2y ago

xeckr2y ago

letitgo123452y ago

They have tons of usage data by now to figure out which queries to devote model capacity to

Racing04612y ago· 6 in thread

reddit thread on the opposite experience - https://www.reddit.com/r/ChatGPT/comments/17prwlg/gpt4_turbo...

a_wild_dandan2y ago

We can always count on Reddit to provide the cold, hard anecdotes.

crooked-v2y ago

From the anecdotes there, it sounds a whole lot like something is going on where it's internally doing a summarizing step to fit into a much smaller actual context window than 128K.

vunderba2y ago

It's like the AI equivalent of "the rest is left as an exercise to the reader" you'd find in old textbooks.

Racing04612y ago

gpt4 should really be called DecartesGPT with this bs.

replwoacause2y ago

Same. It hasn’t blocked me from learning with it but it is a hindrance.

meiraleal2y ago

This is so annoying. And it is getting worse. And a great way for OpenAI to hinder competition and then copy the products being developed using their APIs. This is openai cheating on their userbase.

jpdus2y ago· 5 in thread

For other (non-code) benchmarks, people are having the opposite experience:

- GPT3.5 - 690 (10 wrong) - GPT4 - 770 (3 wrong) - GPT4-turbo (one section at time) - 740 (5 wrong) - GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"

Source: https://twitter.com/wangzjeff/status/1721934560919994823?t=P...

dazzaji2y ago

rafaelero2y ago

Probably not a statistically significant difference there.

exo-pla-net2y ago

N=1

Terretta2y ago

What did you mean by "opposite"?

reitzensteinm2y ago

The Aider article has been updated with the complete results. Previously Turbo was leading slightly. So far any difference is in the noise.

However, in my opinion the first attempt score is more important, and Turbo does genuinely seem to lead there. There's still a possibility the updated training data has tainted the results.

cloudking2y ago· 5 in thread

Has anyone been able to access the 128k context window? I'm not seeing that option in the API playground

throw031720192y ago

Yes, it’s fantastic. It is called gpt-4-1106-preview

infecto2y ago

You might want to double check. The latest turbo models for 4 and 3.5 don’t differentiate. There is only one model and it has 128k context.

kridsdale32y ago

I wonder if 3.5 / 4 is really just "agents are on" or not.

thequadehunter2y ago

The reason he's not seeing it is because it's only available as an API endpoint. It doesn't show up in playground.

3 more replies

johnxie2y ago

Yes, it should be under gpt-4-1106-preview.

Racing04612y ago· 3 in thread

Is this just the api or does it work on chatgpt also?

adamsmark2y ago

dmm2y ago

I haven't seen any changes in the chatgpt plus interface yet.

replwoacause2y ago

Me either

meiraleal2y ago

ttul2y ago

The progress here is remarkable. A year ago, we didn’t even have ChatGPT. LLM completions were cool but so hard to use and definitely there was nothing accessible to non-nerds.

kristianp2y ago

Aider sounds like a cool tool, I'll have to try it out. I'm assuming it makes use of your local files and edits them for you?

Are there any other programming assistant packages that use the chatgpt api like this?

Regarding rate limits, it might be an idea to have configurable delays built in to the testing code to prevent hitting limits.

j / k navigate · click thread line to collapse