This is moderately surprising.
In question answering (QA) style tasks (SQUAD, SQUAD 2) we see state of the art models approach human performance. QA is similar to KBC in the sense that the answers are usually extracted from text in a similar way.
I'd imaging there is potential for fairly rapid improvement in this (Knowledge Base Population) task.
Any benchmark which reflects a task that humans do is a good one, unless it has specific weaknesses that a computer exploits.
I'd use models that are written for this is my work, so I find it useful.
I feel like this is true of any new benchmark.... give some smart folks a few months and they can now beat the task.
In NLP work this use not to be the case. 5 years ago we were stuck at a local maximum.
And this undervalues this task - this bridges the gap between unstructured and structured data. In many ways it is the holy grail for many tasks.
NLP is good enough that we can now explicitly measure how well a system reads text in terms of what knowledge is extracted from it. This task is called Knowledge Base Population, and we've released the first reproducible dataset called KnowledgeNet that measures this task, along with an open source state-of-the-art baseline.
Direct link to the Github repo: https://github.com/diffbot/knowledge-net EMNLP paper: https://www.aclweb.org/anthology/D19-1069.pdf