> not be copyright infringement to train on them either
Copyright is about reproduction. It does not cover uses. Once you bought it, it's yours, as long as you don't reproduce it outside of fair use.
The problem with most language models is they will often uncritically reproduce significant portions of copyrighted works.