You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.
When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.
You can launder copyrighted material through an LLM, basically.