7Systematically generating tests that would have caught Anthropic's top‑K bug (opens in new tab)(theorem.dev)2ag88mo ago0Save
9Training Qwen to answer briefly yet intelligently using feedback control (opens in new tab)(runrl.com)4ag89mo ago0Save
10Launch HN: RunRL (YC X25) – Reinforcement learning as a service (opens in new tab)(runrl.com)71ag89mo ago22Save
15Rebus: A Robust Evaluation Benchmark of Understanding Symbols (opens in new tab)(arxiv.org)arXiv1ag82y ago0Save