I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).
From an information theoretic perspective the two are essentially identical with the exception that standard compression algorithms do not have a proper "loss" function other than just trying to minimize reconstruction loss with the resulting compression size.
Here's a link to the section on the Wikipedia for more information if you'd like [0]. MacKay's Information Theory, Inference and Learning Algorithms is the standard full text treatment of this topic [1]. Ted Chiang's article "ChatGPT is a Blurry JPEG of the web" is pretty good "pop sci" exploration of this topic if you don't want to get too into the mathematics [2].
0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...
1. https://www.inference.org.uk/itprnn/book.pdf
2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...