2ClawsBench shows GPT-5.4 tries to reward hack 80% of the time (opens in new tab)(arxiv.org)arXiv3xdotli2mo ago1Save
4Native CLI scaffolds consistently outper-form OpenCode when using the same model (opens in new tab)(arxiv.org)arXiv1xdotli3mo ago1Save
6Automatically Learning Skills for Coding Agents (opens in new tab)(gepa-ai.github.io)4xdotli4mo ago0Save
7We Reached 74.8% on terminal-bench with Terminus-KIRA (opens in new tab)(krafton-ai.github.io)2xdotli4mo ago0Save
8Self-generated skills don't do much for AI agents, but human-curated skills do (opens in new tab)(theregister.com)2xdotli4mo ago3Save
9First Agent Skills Hackathon by the Authors of SkillsBench (opens in new tab)(skillathon.ai)2xdotli4mo ago1Save
11GPT-5.2 got worse on Terminal Bench 2.0, so is GPT-5.2 Pro (opens in new tab)(twitter.com)1xdotli6mo ago1Save
13Show HN: Chat with Claude Code on iMessage with Instaline (opens in new tab)(twitter.com)2xdotli9mo ago4Save
14Show HN: PokemonGym – 387 milestones designed to test agents and LLMs (opens in new tab)(twitter.com)1xdotli1y ago0Save
15Show HN: BenchFlow – run AI benchmarks as an API (opens in new tab)(github.com)GitHub24xdotli1y ago1Save