undefined | Better HN

0 pointsrybosworld2mo ago0 comments

No, it wouldn't be enough to falsify.

This isn't an experiment a consumer of the models can actually run. If you have a chance to read the article I linked, it is difficult even for the model maintainers (openai, anthropic, etc.) to look into the model and see what it actually used in it's reasoning process. The models will purposefully hide information about how they reasoned. And they will ignore instructions without telling you.

The problem really isn't that LLM's can't get math/arithmetic right sometimes. They certainly can. The problem is that there's a very high probability that they will get the math wrong. Python or similar tools was the answer to the inconsistency.

0 comments

simianwords2mo ago

What do you mean? You can explicitly restrict access to the tools. You are factually incorrect here.

rybosworldOP2mo ago

I believe you're referring to the tools array? https://developers.openai.com/api/docs/guides/tools/

This is external tools that you are allowing the model to have access to. There is a suite of internal tools that the model has access to regardless.

The external python tool is there so it can provide the user with python code that they can see.

You can read a bit more about the distinction between the internal and external tool capabilities here: https://community.openai.com/t/fun-with-gpt-5-code-interpret...

"I should explain that both the “python” and “python_user_visible” tools execute Python code and are stateful. The “python” tool is for internal calculations and won’t show outputs to the user, while “python_user_visible” is meant for code that users can see, like file generation and plots."

But really the most important thing, is that we as end-users cannot with any certainty know if the model used python, or didn't. That's what the alignment faking article describes.

simianwords2mo ago

> To avoid timeouts, try using background mode. As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) reasoning.effort: high. GPT-5 pro does not support code interpreter.

You are wrong from the link you shared. It was about ChatGPT not the api. The documentation makes it unambiguously clear that gpt 5 pro does not support code interpreter. Unless you think they secretly run it which is a conspiracy, is it enough to falsify?

theowaway2134562mo ago

> Unless you think they secretly run it which is a conspiracy

tbh this doesn't sound like a conspiracy to me at all. There's no reason why they couldn't have an internal subsystem in their product which detects math problems and hands off the token generation to an intermediate, more optimized Rust program or something, which does math on the cheap instead of burning massive amounts of GPU resources. This would just be a basic cost optimization that would make their models both more effective and cheaper. And there's no reason why they would need to document this in their API docs, because they don't document any other internal details of the model.

I'm not saying they actually do this, but I think it's totally reasonable to think that they would, and it would not surprise me at all if they did.

Let's not get hung up on the "conspiracy" thing though - the whole point is that these models are closed source and therefore we don't know what we are actually testing when we run these "experiments". It could be a pure LLM or it could be a hybrid LLM + classical reasoning system. We don't know.

1 more reply

rybosworldOP2mo ago

Alright let's say I'm wrong about the details/nuances. That's still really not the point.

The point is this:

> we as end-users cannot with any certainty know if the model used python, or didn't

These tools can and do operate in ways opposite to their specific instructions all the time. I've had models make edits to files when I wasn't in agent mode (just chat mode). Chat mode is supposedly a sandboxed environment. So how does that happen? And I am sure we've all seen models plainly disregard an instruction for one reason or another.

The models, like any other software tool, have undocumented features.

You as an end-user cannot falsify the use of a python tool regardless of what the API docs say.

TLDR: Is this enough to falsify: NO

1 more reply

j / k navigate · click thread line to collapse

0 comments

simianwords2mo ago

What do you mean? You can explicitly restrict access to the tools. You are factually incorrect here.

rybosworldOP2mo ago

I believe you're referring to the tools array? https://developers.openai.com/api/docs/guides/tools/

This is external tools that you are allowing the model to have access to. There is a suite of internal tools that the model has access to regardless.

The external python tool is there so it can provide the user with python code that they can see.

You can read a bit more about the distinction between the internal and external tool capabilities here: https://community.openai.com/t/fun-with-gpt-5-code-interpret...

But really the most important thing, is that we as end-users cannot with any certainty know if the model used python, or didn't. That's what the alignment faking article describes.

simianwords2mo ago

> To avoid timeouts, try using background mode. As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) reasoning.effort: high. GPT-5 pro does not support code interpreter.

theowaway2134562mo ago

> Unless you think they secretly run it which is a conspiracy

I'm not saying they actually do this, but I think it's totally reasonable to think that they would, and it would not surprise me at all if they did.

1 more reply

rybosworldOP2mo ago

Alright let's say I'm wrong about the details/nuances. That's still really not the point.

The point is this:

> we as end-users cannot with any certainty know if the model used python, or didn't

The models, like any other software tool, have undocumented features.

You as an end-user cannot falsify the use of a python tool regardless of what the API docs say.

TLDR: Is this enough to falsify: NO

1 more reply

j / k navigate · click thread line to collapse