bisonbear on Hacker News

1

I evaluated GLM 5.2 against the frontier on tasks from real repos (opens in new tab)

(stet.sh)

2bisonbear4d ago2

2

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (opens in new tab)

(stet.sh)

3bisonbear21d ago0

3

I used autoresearch to improve my AGENTS.md, measured against real tasks (opens in new tab)

(stet.sh)

8bisonbear28d ago7

4

A brief investigation into the GPT-5.5 regression claims (opens in new tab)

(stet.sh)

1bisonbear1mo ago0

5

The Opus 4.7 reasoning curve - Medium is the best default? (opens in new tab)

(stet.sh)

1bisonbear1mo ago0

6

GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)

(stet.sh)

2bisonbear1mo ago0

7

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)

(stet.sh)

4bisonbear1mo ago0

8

I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)

(stet.sh)

2bisonbear2mo ago0

9

Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)

(stet.sh)

1bisonbear2mo ago0

10

Agents.md is the highest-leverage code you're not testing (opens in new tab)

(stet.sh)

1bisonbear2mo ago0

11

Your AI coding benchmark is hiding a 2x quality gap (opens in new tab)

(stet.sh)

3bisonbear3mo ago0

12

Things I Learned at the Claude Code NYC Meetup (opens in new tab)

(benr.build)

2bisonbear5mo ago0

13

Claude vs. Codex in the Messy Middle (opens in new tab)

(benr.build)

1bisonbear5mo ago0

14

Spacetime as a Neural Network (opens in new tab)

(benr.build)

11bisonbear5mo ago5

bisonbear

Recent submissions

I evaluated GLM 5.2 against the frontier on tasks from real repos (opens in new tab)

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (opens in new tab)

I used autoresearch to improve my AGENTS.md, measured against real tasks (opens in new tab)

A brief investigation into the GPT-5.5 regression claims (opens in new tab)

The Opus 4.7 reasoning curve - Medium is the best default? (opens in new tab)

GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)

I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)

Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)

Agents.md is the highest-leverage code you're not testing (opens in new tab)

Your AI coding benchmark is hiding a 2x quality gap (opens in new tab)

Things I Learned at the Claude Code NYC Meetup (opens in new tab)

Claude vs. Codex in the Messy Middle (opens in new tab)

Spacetime as a Neural Network (opens in new tab)

Recent submissions

I evaluated GLM 5.2 against the frontier on tasks from real repos (opens in new tab)

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (opens in new tab)

I used autoresearch to improve my AGENTS.md, measured against real tasks (opens in new tab)

A brief investigation into the GPT-5.5 regression claims (opens in new tab)

The Opus 4.7 reasoning curve - Medium is the best default? (opens in new tab)

GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)

I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)

Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)

Agents.md is the highest-leverage code you're not testing (opens in new tab)

Your AI coding benchmark is hiding a 2x quality gap (opens in new tab)

Things I Learned at the Claude Code NYC Meetup (opens in new tab)

Claude vs. Codex in the Messy Middle (opens in new tab)

Spacetime as a Neural Network (opens in new tab)