I think there are ways around it. The simplest would be to generate replacement data, for example by paraphrasing the original, or summarising, or turning it into question-answer pairs. In this new format it can serve as training data for a clean LLM. Of course the public domain data would be used directly, no need to go synthetic there.
An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.