I work in compliance, and we see this daily. "Do you have an incident response plan?" is trivially easy to verify. But actually finding and assembling that evidence across AWS, Google Docs, Jira, and Slack? That's the hard part nobody benchmarks for.
Curious if BrowseComp accounts for domain-specific retrieval or if it's mostly general web search.