I have to imagine the valuable training data is domain specific stuff like sales call recordings for specific industries and technical materials about specific topics owned by companies. Surely there is enough public or copyright free general purpose material.