It works really well for non-fiction long-form content (i.e. hours of audio).
It’s early days for AudiowaveAI, and I’m looking for feedback to improve the product. Try it out and share your thoughts: [AudiowaveAI](https://audiowaveai.com). Thanks!
There are plenty of blind folks who use traditional text to speech for navigating our devices. We prefer the robot text at ridiculously high speeds. We're humans too.
I would love the option to switch to a more natural voice for more literary text (or even a fan fic) so I'll definitely be checking this out
I'm curious if it would be possible to do some kind of analysis to determine the number of individual characters in the text who are speaking, and then assign an appropriate voice to each of the characters. So if you had something like descriptive language interspersed with a conversation between two characters, that you'd have three voices (a narrator, Character A, and Character B) that are consistent across the text.
For more complex writing with many characters, you'd probably need a wide library of possible voices, and the analysis piece would need to spot-on, since it would be very confusing to have one characters' lines spoken by the wrong voice.
Regarding fanfics, many authors give (or withhold) permissions around creating derivative versions of their work via avenues like ficbinding. Before using a tool like this to create an audio version of their writing, I'd suggest reaching out to a fic's author to see if they'd be okay with that. For personal-only use, though, and especially if it's in context of accessibility for visually-impaired folks, I imagine that many of them would probably be okay with it.
What I do is I split the book up into sentences, generate speech for each sentence and at the same time turn that sentence into subtitles. Then I combine the two and stitch them all together into a mp4 container with audio and a subtitle track using ffmpeg. mpv (and think VLC) can display subtitles synced to audio playback even when there is no video track.
The issue I personally found with traditional TTS is the lack of emotional range and lack of thoughtful pauses. ML models are better at this and picking up on small queues that are hard to program into a TTS otherwise.
I love the iPhone on Safari has a built-in TTS now and was excited to use it. It actually didn't work on Make by Pieter levels after I bought it. So I went to explorer other options. After I started listening to AI generated TTS, I just couldn't go back. It's like 270p vs 2160p (4K).
I've been looking for something that would let me synchronize Librivox recordings with Project Gutenberg epub files, but as much as I love the Librivox volunteers for their contributions, a lot of the recordings are such low audio quality that they're not fun to listen to. This would be a big step up, and there's no copyright worries for this use case because the works are in the public domain!
App is Listenly.io
So similar to my app. But I'm not a real programmer, so of course your is more refined.
I almost launched the same exact online business.
Here's my version (my github version is a bit less refined than my local code):
I also experimented with a desktop app and tried to run open-source models locally.
Being a "real programmer" actually hurts you, I had a lot of things to unlearn to just ship fast and keep iterating. I was too stuck looking for the "best practices" or for it to be "just right" (code words for perfectionism).
So keep iterating and writing many projects . This is project 12 in 16 weeks. I've been doing this challenge of 52 startups in 52 weeks. It's been tremendously helpful. (more about it: http://52shipped.com)
I just converted my app from monolithic to modular, and switched from PySimpleGUI to Qt6.
I think the second biggest factor that kept me from launching the online business was reading up on all the big online marketplaces banning AI audiobooks, at least for now. So I just use it for myself.
Some of my initial versions even had a button to make the output compliant with part of the ACX requirements using complex ffmpeg commands.
I've also found that using the iZotope or Adobe Audition (best) algorithm to stretch the audio by 8-15% makes the listening experience better for difficult material, since OpenAI's slower speed settings don't sound well. So I tend to do tts-1-hd and a 12% stretch with Adobe Audition.
What I would really like is an option to download the whole book as mp3 for offline playback, and different voices for each character.
I found a single file didn't make sense as you lose your place in it easily.
Instead splitting each chapter into a file seemed to do well.
But, you really do want a listening app. Otherwise it get harder to share and listen to on your phone.
So far I created a listening app as a PWA for the time being.
The costs will go down in the future, I hope, and there are promising open-source projects coming up. Their quality is still pretty subpar and I have a comparison table I'll share soon on HN.
Otherwise it is Next.js on Vercel with Postgres DB.
Hope it helps
The videogame Final Fantasy XIV has a lot of text. A LOT of text.
Someone has made a plugin to pipe text to external tts services, or a websocket. You talk to characters in game and hear the dialog read by the tts.
https://github.com/karashiiro/TextToTalk
For whatever reason, amazon poly only exposes middling quality voices to the plugin. And I'd rather not have an active AWS account for just this use case.
ElevenLabs is supported by the plugin, but their service isn't really about tts and I'd have to pay the $220/yr tier to unlock further "pay as you go (per character)" with a budget of 100,000 characters per month. A bit steep for using it only for in this one game.
If someone could help plumb AudiowaveAI to this plugin, I'd gladly turn off AWS for this!
A couple of questions:
How do I delete projects?
I must have tapped three times after submitting a Wikipedia article and it created three projects that apparently cannot be deleted.
How do I delete my account?
And for $15 I get credits. How many credits do I get foe $15? Is each credit a word translate? 1 credit == 1 word translated to audio?
> How do I delete projects?
Three dots on the side of the project, you can delete it
> I must have tapped three times after submitting a Wikipedia article and it created three projects that apparently cannot be deleted.
> How do I delete my account?
Just email me support@audiowaveai.com with form that email and I'll delete it of you. Still MVP no functionality for that yet.
> And for $15 I get credits. How many credits do I get foe $15? Is each credit a word translate? 1 credit == 1 word translated to audio?
1 credit = 1 character. You are right I need to be more clear on it. $15 would give your about 10hrs of audio or 100 articles (~5-6mins). ElevenLabs will cost you $99 for the same audio.
How have you got the costs so low? Also the GCP voices don't have as natural intonation. How did you do that?
I really didn't think there would be a market for this either.
What I'm actually interested in is your pricing model. Why do you have constraints on characters AND articles, versus just characters? Does doing the conversion cost a static amount that you don't want someone making 10000 requests a month? Or is the article count and hours of audio just an estimation of the 600,000 character limit?
If it's just an estimate of real usage of the actual 600,000 character limit, then I'd try and word it differently, otherwise I feel like I'm going to be heavily constrained by the platform.
It is a lot more clear and avoids any misunderstandings. Just need to make the changes to the app to do that.
Thanks for the great feedback
So you can either copy and paste a markdown file there or upload it. PDF and epub would work too, but it is a bit more finky.
I'm happy to help; there are a few authors who contacted me, and I'm releasing an HD quality for authors this week, too (less static noise).
Ping me at michael@audiowaveai.com
Works best with .epub
The base ai model sounded like whisper ai from meta. Did you train the voice yourself or is it one of defaults?
I am always curious as to what copyright issues products like this run into. Also whats the stack like to build something like this?
I would love to be able to run the models on the device, and it will come in the future. The OSS models are not quite there from a quality standpoint yet. But they will surly get there
There is some promise from Myshell models on huggingface and I hope to see them keep evolving.