xdotli on Hacker News

1

Frontier Model Training Methodologies (opens in new tab)

(djdumpling.github.io)

2xdotli1mo ago1

2

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time (opens in new tab)

(arxiv.org)arXiv

3xdotli2mo ago1

3

Chaos of Agent (opens in new tab)

(agentsofchaos.baulab.info)

1xdotli3mo ago1

4

Native CLI scaffolds consistently outper-form OpenCode when using the same model (opens in new tab)

(arxiv.org)arXiv

1xdotli3mo ago1

5

We compare model quality in Cursor (opens in new tab)

(cursor.com)

2xdotli3mo ago0

6

Automatically Learning Skills for Coding Agents (opens in new tab)

(gepa-ai.github.io)

4xdotli4mo ago0

7

We Reached 74.8% on terminal-bench with Terminus-KIRA (opens in new tab)

(krafton-ai.github.io)

2xdotli4mo ago0

8

Self-generated skills don't do much for AI agents, but human-curated skills do (opens in new tab)

(theregister.com)

2xdotli4mo ago3

9

First Agent Skills Hackathon by the Authors of SkillsBench (opens in new tab)

(skillathon.ai)

2xdotli4mo ago1

10

The First Agent Skills Benchmark (opens in new tab)

(huggingface.co)

1xdotli4mo ago1

11

GPT-5.2 got worse on Terminal Bench 2.0, so is GPT-5.2 Pro (opens in new tab)

(twitter.com)

1xdotli6mo ago1

12

Claude Skills as a Meta Tool (opens in new tab)

(leehanchung.github.io)

2xdotli7mo ago0

13

Show HN: Chat with Claude Code on iMessage with Instaline (opens in new tab)

(twitter.com)

2xdotli9mo ago4

14

Show HN: PokemonGym – 387 milestones designed to test agents and LLMs (opens in new tab)

(twitter.com)

1xdotli1y ago0

15

Show HN: BenchFlow – run AI benchmarks as an API (opens in new tab)

(github.com)GitHub

24xdotli1y ago1

xdotli

Recent submissions

Frontier Model Training Methodologies (opens in new tab)

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time (opens in new tab)

Chaos of Agent (opens in new tab)

Native CLI scaffolds consistently outper-form OpenCode when using the same model (opens in new tab)

We compare model quality in Cursor (opens in new tab)

Automatically Learning Skills for Coding Agents (opens in new tab)

We Reached 74.8% on terminal-bench with Terminus-KIRA (opens in new tab)

Self-generated skills don't do much for AI agents, but human-curated skills do (opens in new tab)

First Agent Skills Hackathon by the Authors of SkillsBench (opens in new tab)

The First Agent Skills Benchmark (opens in new tab)

GPT-5.2 got worse on Terminal Bench 2.0, so is GPT-5.2 Pro (opens in new tab)

Claude Skills as a Meta Tool (opens in new tab)

Show HN: Chat with Claude Code on iMessage with Instaline (opens in new tab)

Show HN: PokemonGym – 387 milestones designed to test agents and LLMs (opens in new tab)

Show HN: BenchFlow – run AI benchmarks as an API (opens in new tab)

Recent submissions

Frontier Model Training Methodologies (opens in new tab)

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time (opens in new tab)

Chaos of Agent (opens in new tab)

Native CLI scaffolds consistently outper-form OpenCode when using the same model (opens in new tab)

We compare model quality in Cursor (opens in new tab)

Automatically Learning Skills for Coding Agents (opens in new tab)

We Reached 74.8% on terminal-bench with Terminus-KIRA (opens in new tab)

Self-generated skills don't do much for AI agents, but human-curated skills do (opens in new tab)

First Agent Skills Hackathon by the Authors of SkillsBench (opens in new tab)

The First Agent Skills Benchmark (opens in new tab)

GPT-5.2 got worse on Terminal Bench 2.0, so is GPT-5.2 Pro (opens in new tab)

Claude Skills as a Meta Tool (opens in new tab)

Show HN: Chat with Claude Code on iMessage with Instaline (opens in new tab)

Show HN: PokemonGym – 387 milestones designed to test agents and LLMs (opens in new tab)

Show HN: BenchFlow – run AI benchmarks as an API (opens in new tab)