Hey!
The speed of production was definitely a consideration. The idea was to start with communities that read same/similar content. So that a lot of people would be requesting same articles for narration. Certainly, when the article is requested for the very first time — it could take up to 24 hours. At scale, even with human narrators, the gap could be minimized to an hour or less.
When an already recorded article is being requested — it's instant obviously.
Also, considering use-case, most users should be fine with audio not being available right away. Imagine: you request an article to get narrated, and by the time you hit the gym, start driving home/to work — the audio will be ready!