The dataset you need to train in the first place is indeed huge, but I think the idea is once them model is trained, new "voices" can be acquired with much less data than was required to train it in the first place. Just like you can instruct ChatGPT to talk about topics never heard of on the internet and in a dialect you customize and invent on the spot and it can comply, despite not consuming an internet's worth of subject matter about it.
Soon the role of the Indian call centers will change from running the scam directly to making spam calls to trusted contacts of the intended mark to collect voice data for TTS model fine tuning.