The weak link was the available free/open datasets. You needed a single speaker with a pleasant voice, 20hrs+ material from varied sources, recorded in a good recording enviroment with a good mic etc. For English, the go-to was LJSpeech, which doesn't fulfill all these requirements. I say 'was', as I haven't followed developments recently.
Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.
Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):
https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...