undefined | Better HN

0 pointsXCSme1mo ago0 comments

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

0 comments

> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

Edit since I'm not able to reply to the below comment:

"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.

XCSmeOP1mo ago

Why not? I described this in more detail in other comments.

Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.

Most models get this right. Also, this is just one failure mode of Claude.

BoorishBears1mo ago

Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON

I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.

XCSmeOP1mo ago

The questions do ask specifically to respond with the answer only, with an example format given in many cases.

Note that all reasoning models are tested with "medium" reasoning.

The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).

Gemini models also tend to be very consistent. Asking the same question will likely give the same result.

The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).

1 more reply

j / k navigate · click thread line to collapse

0 comments

BoorishBears1mo ago

> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

Edit since I'm not able to reply to the below comment:

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.

XCSmeOP1mo ago

Why not? I described this in more detail in other comments.

Most models get this right. Also, this is just one failure mode of Claude.

BoorishBears1mo ago

Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON

XCSmeOP1mo ago

The questions do ask specifically to respond with the answer only, with an example format given in many cases.

Note that all reasoning models are tested with "medium" reasoning.

The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).

Gemini models also tend to be very consistent. Asking the same question will likely give the same result.

The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).

1 more reply

j / k navigate · click thread line to collapse