DeepSWE Measuring frontier coding agents (opens in new tab)

(deepswe.datacurve.ai)

2 pointse2e425d ago1 comments

1 comments

1 comments · 1 top-level

e2e4OP25d ago

gpt-5.5xhigh leading benchmark, coincides with my recent experience. I've been opus 4.7 user but it burns tokens so quickly, so gave gpt-5.5xhigh (via codex) a try, quality was similar (if not better), and tokens lasted a lot longer.

j / k navigate · click thread line to collapse