So like in my sister reply, I don't see the Backprop, but maybe I'm missing it. This article does use the word, but in a generic way
"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"
But I think this is more of a borrowing and it's not used again in description and may just be a misconception. There's no use of the Backprop term in the original paper nor any stage of learning where output errors are run thru the whole network in a deep regression.
What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.
Is there a deep regression? Maybe I'm missing it