—
v0.5.0 was about figuring out why models weren’t using tilth tools consistently — even when they were available.
Results vs baseline (built-in tools only):
Sonnet 4.6: -44% $/correct (84% → 94% accuracy, 31% fewer turns)
Opus 4.6: -39% $/correct (91% → 92% accuracy, 37% fewer turns)
Haiku 4.5: -38% $/correct (54% → 73% accuracy, 7% fewer turns)
—
https://github.com/jahala/tilth/
Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...
— PS: I don't have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.
v0.4.4: Added adaptive 2nd-hop impact analysis to callers search — when a function has ≤10 unique callers, tilth automatically traces callers-of-callers in a single scan. First full 26-task Opus baseline (previously 5 hard tasks only). Haiku adoption improved from 42% to 78%, flipping Haiku from a cost regression to -38% $/correct.
v0.4.5: Bumped TOKEN_THRESHOLD from 3500 to 6000 estimated tokens (~24KB), so mid-sized files return full content instead of an outline that agents then read back via 5–7 sequential --section calls. Fixed two major regressions: gin_radix_tree (+35% → ~tie) and rg_search_dispatch (+90% → -26% win). Sonnet hit 100% accuracy (52/52) and -34% $/correct overall.
--
https://github.com/jahala/tilth/
Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...
-- PS: I dont have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.
New stuff: files sync to each other. Edit the CSS in any file, run --sync css, and every sibling file gets the update. No build tool, no shared imports. Just files copying sections between themselves using markers.
Dark mode, responsive layout, search — the usual. Still zero deps beyond bash and Claude Code.
--
v0.4.0 added search ranking, sibling surfacing, transitive callees, cognitive load stripping, smart truncation, and bloom filters. Got -17% on Sonnet, -20% on Opus.
v0.4.1 was pure instruction tuning — zero code changes that alone jumped Sonnet adoption from 89% to 98% and $ cost/correct answer from -17% to -29%.
The instruction tuning result surprised me. The model already knew tilth tools existed — it just wasn’t choosing them consistently. Making the replacement relationship explicit in the tool description was worth more than all the search ranking work in v0.4.0.
Haiku remains the outlier — only 42% tilth adoption despite instruction tuning.
--
https://github.com/jahala/tilth/
Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...
-- PS: I dont have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.
-> https://github.com/jahala/tilth
Results: Sonnet 4.5 — 26% cheaper per correct answer (79% → 86% accuracy). Opus 4.6 — 14% cheaper (and the only model+mode combo to crack the hardest task). Haiku 4.5 — 82% cheaper when forced to use tilth (69% → 100% accuracy at $0.04/answer).
We measure “cost per correct answer” — what you’d expect to spend before getting a usable answer under retry. A wrong answer isn’t a cheap success.
Interesting finding: smarter models adopt MCP tools voluntarily (Sonnet 95%, Opus 94%), but Haiku ignores them (9%). Instruction tuning didn’t help. Removing the overlapping built-in tools did.
https://github.com/jahala/tilth/blob/main/benchmark/README.m...
PS: I dont have the budget to run the benchmark a lot with Opus, so if any token whales has capacity to run some benchmarks, please feel free to PR results.