1. Transformers are an extension of the attention mechanism, a well known LSTM extension that worked well for machine translation (2014, https://arxiv.org/abs/1409.0473). A transformer model is essentially building a multi headed attention module, analogue to CNNs, then stacking several layers of them, analogue to CNNs / stacked LSTMs. (1998, http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)
2. Transfomers use residual blocks, which were introduced by the ResNet CNN arhitecture (2015, https://arxiv.org/abs/1512.03385). At the time, ResNet was topping the ImageNet benchmarks. This technique helps preventing the vanishing gradient problem during training.
3. Transformers use normalization extensively. Layer normalization and attention normalization. This helps keeping internal vectors in the neighborhood of 1 and prevents vanishing gradient training collapse.
4. Correct initialization of the network vectors also helps preventing the vanishing gradients problem.
5. Unsupervised pretraining was one of the first tricks to make deep networks work (2009, https://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf)
6. Pretraining was used extensively in the vision community for transfer learning, i.e. reusing the weights of a network trained on ImageNet and replace the top layers / loss function to tackle a different problem.
7. Finally, language modelling, that is predicting the next word in a sentence, was a well known technique to make machine translation work better. Researchers were looking for better language modelling techniques using large corpora (2013, https://arxiv.org/abs/1312.3005, https://www.kaggle.com/c/billion-word-imputation)
BERT was motivated by the discovery that neural nets will try to learn functions you show them and some ideas of how to route information so it forms bottlenecks that force the network to learn general representations, but there was not a mature theory behind it and judging by that review paper there still isn't.
That whole paper reads to me like an account of wandering in the dark. If I read a paper that long about liquid rocket fuels I'd learn that 99% of the things I might want to use as a rocket fuel won't work and I really have a choice of Hydrogen, Methane or RP-8 and Oxygen.
If you were out to "build a better BERT", even a slightly better BERT, that paper doesn't give clear guidelines about what you should do.
It's got all the trappings of a field which is preparadigmatic but could be mistaken for paradigmatic because of the sheer volume of researchers, conferences, papers, etc.
In your analogy, there is me the programmer and the library's programmer. If the library has a consistent behavior, I need only to know how to use it, I don't need to know how it works internally. But the library's programmer needs to know how it works. He may not need to know how the building blocks, he's using in his library, work. but he does need to know how they interact to be able to deliver new features and bugfixes. (I think) He can't add features and correct bugs by trial and error (at least not all the time :)) .
So In your analogy, I think the researchers that came up with the BERT model are the library's programmer and not the programmer.