One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.
Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...