If you think about it, when we are going to do full reasoning, how is the data to be represented? Embeddings and flat lists/matrices are not appropriate for the way objects interrelate. It has to be a kind of graph. Here they used multiple attentions instead, which kind-of work the same way as graphs, attention heads being similar to links between objects.
Once we have data represented as graphs we can also do simulation - we apply the rules of each object iteratively on the graph. The graph can be seen as an automata, where each object updates its state by integrating information from its neighbors. Automata are general Turing machines - they can represent and simulate any computation. With simulation we can do optimal solutions search. It opens a lot of doors for AI.
My money is on simulation and graphs for the next level of AI.