The training process doesn't involve any copies being made. At least anymore than viewing an image on the internet copies it into your RAM.
Transformers's analyze images, they don't copy them. You might call this semantics, but you probably also wouldn't call out an algorithm that counts black pixels on website images as "copyright violation".
There is a lot of nuance here and a lot to consider. Transformers are not archives of images, they are archives of relationships. This is key because you don't have to copy an image to measure the relationships between it's pixels.
Train a transformer on one image, and it will just output noisy garbage.