1GPT-5.5 low vs. medium vs. high vs. xhigh: the reasoning curve on 26 real tasks (opens in new tab)(stet.sh)2bisonbear2d ago0
2GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (opens in new tab)(stet.sh)4bisonbear9d ago0
3I ran Opus 4.7 vs. Old Opus 4.6 vs. New Opus 4.6 on 28 Zod tasks (opens in new tab)(stet.sh)2bisonbear23d ago0
4Coding evals are broken. CI is green while AI code quality goes unmeasured (opens in new tab)(stet.sh)1bisonbear25d ago0
5Agents.md is the highest-leverage code you're not testing (opens in new tab)(stet.sh)1bisonbear1mo ago0