TLDR: transformer models (on gpt2 scale) are great (near-optimal) at interpolating between the cases given in (pre-)training, but as soon as we leave the training domain fail at extrapolation. Impressive results may be more due to the wide breadth of (pre-)training data, and less due to generalization ability.