1I evaluated GLM 5.2 against the frontier on tasks from real repos (opens in new tab)(stet.sh)2bisonbear4d ago2Save
2I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (opens in new tab)(stet.sh)3bisonbear21d ago0Save
3I used autoresearch to improve my AGENTS.md, measured against real tasks (opens in new tab)(stet.sh)8bisonbear28d ago7Save
4A brief investigation into the GPT-5.5 regression claims (opens in new tab)(stet.sh)1bisonbear1mo ago0Save
5The Opus 4.7 reasoning curve - Medium is the best default? (opens in new tab)(stet.sh)1bisonbear1mo ago0Save
6GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)(stet.sh)2bisonbear1mo ago0Save
7GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)(stet.sh)4bisonbear1mo ago0Save
8I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)(stet.sh)2bisonbear2mo ago0Save
9Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)(stet.sh)1bisonbear2mo ago0Save
10Agents.md is the highest-leverage code you're not testing (opens in new tab)(stet.sh)1bisonbear2mo ago0Save
11Your AI coding benchmark is hiding a 2x quality gap (opens in new tab)(stet.sh)3bisonbear3mo ago0Save
12Things I Learned at the Claude Code NYC Meetup (opens in new tab)(benr.build)2bisonbear5mo ago0Save