On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well, for example by changing the pacing during chaotic moments. Or those audiobooks with multiple narrators and different voices for each character. Not to mention that sometimes the only cue you get for who's speaking during dialogue is how the voice actor changes their tone. I have mixed feelings about using this and losing some of that quality.
I would totally use this over amateur ebooks or public domain audiobooks like the ones on project guttenberg. As cool as it is/was for someone to contribute to free books... as a listener it was always jarring to switch to a new chapter and hear a completely different voice and microphone quality for no reason.
This (and everything else with AI) isn't saying "you don't need good actors any more". It's saying "if you don't have an audiobook, you can make a mediocre one automatically".
AI (text, images, videos, whatever) doesn't replace the top end, it replaces the entire bottom-to-middle end.
IMGO(gut opinion), generative AI is a consumption aid, like a strong antacid. It lets us be done with $content quicker, for content = {book, art, noisy_email, coding_task}. There's obvious preconceptions forming among us all from "generative" nomenclature, but lots of surviving usages are rather reductive in relevant useful manners.
Even on the non-fiction side, the narration for Gleick's The Information adds something.
While I want this tool for all the stuff with no narration, NYT/New Yorker/etc replacing human narrators with AI ones has been so shitty. The human narrators sound good, not just average. They add something. The AI narrators are simply bad.
New authors, self-publishers, can't afford tens of thousands of dollars to get an audiobook recorded professionally... This can limit their distribution.
Authors might even choose not to make such version (or lack confidence to record themselves), so AI capable of making a decently passable version would be nice -- something more than reading text blandly. AI in theory could attempt to track the scene and adjust.
I wonder if a standardized markup exists to do so.
With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.
Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.
If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?
Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.
TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html
My example, I was never a Wheel of Time fan, but the new audio editions done by Rosamund Pike are quite the performance, and make me like the story. She brings all the characters to life in a way thats different than just reading. It's a true performance.
Computer chess took a long time to get better than the best players in the world, but it was better than most chess players for many years before that. We're seeing that a lot with these generative models.
Just imagine what this would do for writers. They can get instant feedback and adjust their book for the audiobook.
Anyway, even if in theory it might, in practice things may end even worse than doing it with a monotone voice.
He also narrates another scifi book series and honestly I dislike this a lot.
He became the voice of one particular character for me.
I would love variety
There's some contemporary discussion of what happened here: https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...
I think there is still integration with Audible, though. If you buy a book on the Kindle and on Audible, the position will sync, and you can switch between listening and reading without losing your place in the book.
I tried it while on a treadmill so it allowed me to follow the book with more focus without sacrificing much else.
It wasn't a good experience but it was nice to be able to keep 'reading' a book while I was exercising.
It worked for me for over a decade, until I broke the device. I don't know if I never updated the firmware or if the fact I used Calibre to convert books bypassed the feature gate.
It's more of an open problem how to create those epubs. I have some code that can do it using Elevenlabs audio, but I imagine it way harder to have something similar for a human narrator.... who's going to do the sync? Maybe we need a sync AI.
For Android:
- Moon+ reader pro - some paid high-quality TTS voices (like Acapella)
For iOS:
- Kybook reader and internal iOS voices (no external TTS voices for the walled garden)
This works well enough to listen to a book while you walk and when you get back home read on the WC from the place you stopped.
Additionally if you buy a tablet or an android ebook reader, you install the app there an you can continue on your bigger/better device seamlessly.
Whisper-sync for the masses! Ahoy...
What surprised me a good way was my Kindle app was aware of this and asked if I wanted to download the audible version of the current book I am reading.
Been listening on the way to work and then reading on the way back. Enjoying it so far.
Not quite seamless but it works. It has a cursor that follows the words as they’re spoken to, which allows you to read and hear (“immersive reading”) which I find to be extremely helpful for maintaining focus.
Edit: I'll wait to see if any recommendations get made here, if not I might give this one a go: https://github.com/coqui-ai/TTS
I also found DEMUCS + Whisper + pydub to be a super helpful combo for creating quality datasets.
Though according to the TTS leaderboard, Fish Speech https://github.com/fishaudio/fish-speech and Kokoro are higher.
https://jdsemrau.substack.com/p/teaching-your-agent-to-speak...
Years ago, when I was dating someone who spoke Russian as one of her native languages, we had to do a funny compromise when watching films together with her parents: they didn't speak a word of English, so we'd use the Russian dub with English subtitles.
I noticed that the Russian dub was just one man reading a translation in a flat voice over what was happening on the screen, no attempts at voice acting or matching the emotions. Usually the dub would have a split second delay to the actual lines, so you'd still hear the original voices for a moment (and also a little bit in the background).
At first I found it very jarring, but they explained that this flatness was a feature. You'll quickly learn to "filter out" the voice while still hearing the translation, and the faint presence of the original voices was enough to bring the emotional flavor back. The lack of voice acting helped with the filtering.
This turned out to apply to me as well, even though I don't speak Russian! My brain subconsciously would filter out the dub, and extract most of the original performance through the subtitles and faint presence of the original voices. Obviously the original version would have been a better experience for me, but it was still very enjoyable.
Of course a generated audiobook is not a dub, as there is no "original voice" to extract an emotional performance from. But some listeners might still be able do something similar. The lack of understanding in the generated voice and its predictable monotony might allow them to filter out everything but the literal text, and then fill it in with their own emotional interpretations. Still not as great as having proper story teller who does understand the text and knows how to deliver dramatic lines, but perhaps not as bad as expected either.
When the foreign movies started to filter into the Soviet Union's illegal movie theatres, you would get 3 or 4 movies playing at once in one room. There would be a TV in each corner of the room and 4 or 5 rows of plastic chairs in front of it in an arch.
ALL of the movies were being revoiced by the same person. So, if you were sitting in the back of the 5th row, you were potentially getting the sound from an action movie, a comedy, a horror movie and a romance at the same time. In the same voice.
You learned to filter really well. So, if that's what they were trained on, watching a single movie must have been very relaxing.
To add on a slight tangent. Many books/audiobooks just don't exist in other languages at all. So even getting some monotone is a lot better than getting nothing.
I think this is where these models really shine. Cheaply creating cross language media and unlocking the knowledge/media to underprivileged parts of the world.
I dislike german and russian style dubs as well, I'd rather learn a bit of the original language.
So, it was not just the voice, but the quality control pipeline that was missing as well.
Maybe it mostly works for old plain text books, but if nobody is checking.....
Here is a detailed comparison chart I have made that tracks over 100 features across most popular apps: https://speechcentral.net/speech-central-vs-voice-dream-read...
$80/yr.
Yaaaaaay.
Like you, though, I had that reaction to the subscription model for macOS and therefore decided not to "buy" it when it came out.
Might be because our brains try to 'feel' the speaker, the emotion, the pauses, the invisible smile, etc.
No doubt models will improve and will be harder to identify as AI generated, but for now, as with diffusion images, I still notice it and react by just moving on..
Take a moment here for a second though and think about it. Even if these voices got to be really good, indistinguishable almost... would I want to listen to it even then? If it was an NPC's generated voice and generated dialogue in a game to help enrich the world building, maybe in that context. On YouTube or with newscasters? Probably not. Audio books? Think I would still rather have it be a real person, because it's like they're reading a story to me and it feels better if it's coming from someone. There's also the unknown factor, where if it's ML generated it's so sterile that the unknowns are kind of gone.
Think about it like this, in the movie industry we had practical effects that were charming in a way. You could think about the physical things that had to occur to make that happen. Movie magic. Now, everything is so CG it's like the magic is gone. Even though you know people put serious hard work into it, there's a kind of inauthenticity and just lack of relevance to the real world that takes something away from it.
It's like a real magician has interesting tricks, while an artificial magician is most likely just a liar.
Still, I grant that it makes some cool things possible and there is potential if things are done right. Some positive mixture of real humans and machine generated stuff so it isn't devoid of anything connected to real life effort.
Future generations will never know a world where you don't watch a 2 hour AI generated orientation video about the wonders of working for Generic Corp when you start a new job.
> I never said she stole my money
It can have 7 different meanings based on which word you stress out.
The new AI voices sound very natural at a shallow level, but overall pronounce things in odd ways. Not quite wrong, but subtly unnatural which introduces some cognitive load.
Old TTS systems with their monotonic voices are less confusing, but sound very robotic.
I mean, I do that because it's correlated with the content being garbage. If I'm intentionally using it on content I want to consume I expect it to be different, though I haven't gotten around to trying it properly yet so I guess we'll see. (OTOH I already listen to ebooks via pre-AI TTS, so I'm optimistic)
Doesn't mean the quality is bad. In fact I think Kokoro's quality is amazing.
But it is not the right tool for narration, the kind of training data they use make the sound too flat, if that makes sense.
- take an ebook in any language - AI translates it to German - AI speaks it using the voice of their fav narrator - a UI showing the text as it is being read
Now they can read Asimov, Kulansky, Bryson, regardless of whether a translation or audio version exists. :)
https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...
That should actually be possible to do already with existing tech. I haven't seen if you can instruct Kokoro to read in a certain way, does anyone know if this is possible?
https://emosphere-tts.github.io/
We are getting there
How the hell was it trained on that little data ?
https://k2-fsa-web-assembly-tts-sherpa-onnx-en.static.hf.spa...
The saddest thing is that people will still continue to participate in consuming these AI produced “goods”.
I once heard an American friend with so-so Japanese ability ask a Japanese woman who had recently had a heart operation how her kokoro was doing, and she looked surprised and taken aback.
Side note: After I started reading HN in 2019, I was struck by how many tech products mentioned here have Japanese names. I compiled a list for a few years and eventually posted it:
I'm not sure if that is related here.
I'm checking what the actual quality is (not a cherry-picked example), but:
Started at: 13:20:04 Total characters: 264,081 Total words: 41548 Reading chapter 1 (197,687 characters)...
That's 1h30 ago, there's no kind of progress notification of any kind, so I'm hoping it will finish sometime. It's using 100% of all available CPUs so it's quite a bother. (this is "tale of a tub" by Swift, it's about half of a typical novel length)
It did finish and result is basically as good as the provided example, so I'd say quite good! I'll plan to process some book before going to bed next time!
Chapter 1 read in 6033.30 seconds (33 characters per second)
Why elevenlabs has such a lead in this space? It sounds better than OpenAI and Google models
Guess it was just a matter of time till someone figured out how to use "AI" to resume encouraging illiteracy.
Guess it was just a matter of time till someone figured out how to use "cars" to resume encouraging being unable to to a basic farrier job.
If you haven't observed this in many other markets, you live an unusual (or unobservant) life.
The odd thing is that while they are releasing these great sounding models, they are not documenting the training process. What we want to know is what magic if any allowed them to create such wonderful voices...
But this one works pretty quick, is easy to install, has some passible voices. Finally I can start listening to those books that have no audio version.
I'm a slow reader, so don't read many books. If a book doesn't have an audiobook version, chances are I won't read it.
PS, I have used elevenlabs in the past for some small TTS projects, but for a full book, it's price prohibitive for personal use. (elevenlabs has some amazing voices)
Thank you to the dev/s who worked on this!
Example is Hobbit and Lord of the Rings, the narrator Rob Inglis, makes an amazing voice performance giving depth to environments and characters. And of course the songs!
Depending on what that means, it might be more accurate to say it was trained on 100 hours of audio and with the aid of another, pre-trained model. The reader who thinks “only 100 hours?!” will know to look at the pretraining requirements of the other model, too.
I am curious, is there an equivalent light model for speech to text, that can run real-time on the MacBook? I'm just playing around with AI models and was looking into this (a fully locally running app that lets you talk to your computer).
Som audiobooks have this and I think it really makes the experience much more engaging.
(Also maybe some background sound effects but not sure about that, some books also have this and it's quite nice too)
It's one step above "normal" text-to-speech solutions, but not much above it. The epub has "Chapter 1" as the title on the page, and a lot of whitespace, and then "This was...." (actual text). The software somehow managed to ignore all the whitespace and reach "chapter 1 this was.." as a single sentance, no pauses, no nothing.
Blind? A great tool. Will it replace actual audiobooks? Well.. not yet at least.
... audiblez book.epub -l en-gb -v af_sky.
it does not, instead it installs a python package with a cli interface, to run you then have to prepend python and load the module like this:
python3 -m audiblez book.epub -l en-gb -v af_sky.