Most of it is high definition audio these days, and then that just gets replaced by a 10gb training set, or maybe the training set becomes a shared resource on the console
Generating quality voice is sufficiently compute-intensive that it would increase the file size, as they would still ship all the audio (instead of computing locally) but there would just be so much more of it.