I built vibeCoach : a voice AI where you actually practice these conversations out loud, and the AI responds like a real woman would.
She starts guarded. One-word answers, a little skeptical. If you escalate too fast or try something cheesy, she gets MORE guarded. If you're genuine and read the moment right, she opens up. Just like real life.
Under the hood it's a multi-agent system : multiple AI agents per conversation that hand off to each other as her emotional state shifts. The transitions are seamless. You just hear her tone change.
Voice AI roleplay is a proven B2B category : sales teams use it for call training. I took the same approach and pointed it at the conversation most men actually struggle with.
There's a hard conversation scenario too : she's angry about something you did, she's not hearing logic, and you have to navigate her emotions before you can resolve anything. That one's humbling.
Live at tryvibecoach.com. Built solo. Happy to answer questions.
τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.
τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.
We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.
Code and leaderboard are open — we'd welcome community submissions and feedback.
Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...