Aye, this should be obvious even to non-technical folks. Much has been written about how LLMs regurgitate the data they were trained on. So if you're looking for data to train on, you can certainly extract it there.
Plus of course for people within the tech bubble, plenty of research results on the value of synthetically augmented and expanded training data that put the impact past just regurgitating source data.
This whole episode is a failure of reporting what to expect next and projecting running costs etc. most of all.