1
AI Agent Testing
As I work on AI agents, I find myself constantly thinking about how to effectively test them.
As we integrate more knowledge sources and expand our agents' capabilities, testing becomes increasingly complex. As a standard practice, we use evals to ensure quality is maintained. But honestly, I feel like something is missing.
The issue I’m seeing is that we, as engineers, sometimes lack sufficient domain knowledge to assess an agent's response accurately. At the same time, current tooling limits the possibility of collaborating with domain experts to perform testing together. For example, current tooling gives priority to dashboards over the readability of actual outcomes
This has been my experience so far—I would love to hear your thoughts on this.