here is a thing to look at:
https://en.wikipedia.org/wiki/Speech_recognition_software_fo...
>>the speech processing tool is not plug and play which creates a barrier to the fact that how many people actually end up using them<<
this does not have to be plug and pray, when a copyright holder converts one of thier works to synthesized speech, it is a master product that is distributed through whatever mechanism they choose with a great degree of amplification. Only one person has to use it and create a portable media container that is in common use then provide it to the user base.
>>Thirdly, there are not many entities that offer this many voice choices based on accent, gender and texture<<
if someone knows how to use the available FOSS tools, what you have done can be replicated or exceeded. the FOSS is, local, and does not require surrendering intellectual property pre-publication.
>> but the end product will generally seem like a rushed, low-quality , not-so-natural robot narrating your favourite content<< This Is Not True!
>>the task of converting ebook to text on your own<< this is trivial.