This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)
Bandwidth has always been crazy cheap.
In fact locally I can get a 10 gbps home internet unmetered connection for $300/mo.
I'm not sure how they'd react if I transferred 1 PB/mo though :)
1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.
2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.
Or better yet, how about asking me where I want to store my models?
Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:
- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)
- Are already underpaid for their work as-is
- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.
I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.
The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.
Its going to get crazy.
Voice cloning is a special case, these models are equally good at making new voices.
Pick your book, pick your reader and away it goes. The Diary of Anne Frank read by Gilbert Gottfried.
I'd like to read it, in any case.
I'm not sure how to feel about that. I'm against the idea that some people "deserve" being paid for being lucky born with an interesting voice.
On the other hand, the world always worked like that. And, say, hard-working farmer or doctor were also lucky being born with necessary traits to make for their living, while others weren't.
Singers didn't want software clones, but voices actors are fair game.
AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.
The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.
The real product would have a real voice over actor paid for with VC money.
It's only been a year. Give it some time and I'm sure AI will have much better results. Right now, you can get some of that unique work by finetuning the AI off of a person's existing portfolio.
Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.
Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.
* Create dynamic new voice lines at runtime, for example game characters reacting to new situations.
* Operate at a scale that's infeasible for humans, for example turning every ebook into an audiobook.
The work product produced by their voice for fulfilling the contract is owned. No corp owns someone else's voice.
It's one reason why VAs rarely take fan requests for a character they voice.
I am constantly amazed at how the new AI tech can be used.
Of course this would be illegal under most countries copyright laws.
While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.
Enjoy: https://youtu.be/gmNSFqyg_Z8
I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.
The one that always comes to mind for me is this video of an Eminem interview done from scratch as a Talking Heads song: https://www.youtube.com/watch?v=Kfl3N9nesRg
This is potentially something that generative AI could be good at doing (at least recreating vocals), but this parody of the Talking Heads required a lot of very clever insight into what made a good Talking Heads song and returned a convincing and novel melody. And I think we are still a ways off.
its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.
Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.
In the streets, sure. Meeting up at out of town conference centers a few times a year, probably not. Most real communities have always been "dark matter" to those outside them; Discord working the same way feels more authentic than most of the internet.
There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.
There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.
These big teleconference apps are usually hit or miss but discord seems to be the winner currently for actual "social networking", also add in its trend in the gaming community
That being said, Discord does have some advantages over older forum-type communities - it's usually way better for cultivating smaller communities, and its no-effort-required chat systems means that you can always hop on and discuss things that are on the cutting edge. This is quite important in a field like AI, where it feels like something revolutionary happens every other week.
(Also, I don't know if that implication was intentional, but gen Z and "underaged" haven't meant the same thing for many years now)
Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)
https://arstechnica.com/information-technology/2022/09/james...
Watch Light My Fire on YouTube Music https://music.youtube.com/watch?v=lN3v3EfA6_A&si=_hcG3Wjakxd...
I can't figure out if this is an example of Godwin's Law or not.
I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].
Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).
IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.
PRE-EDIT, ERRONEOUS ANSWER
Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.
[0] https://github.com/snakers4/silero-models#text-to-speech
[1] https://silero.ai
[2] https://github.com/snakers4/silero-models#standalone-use
[3] https://github.com/Grumbel/ttsprech#usageI'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/
Looks promising, I'm going to check it out too! MIT license, even! If it's fast enough for real time, it could be the new best option. The paper claims faster inference than VITS...
How many audio books is 40 hours?
Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.
We're probably several years out from it being something people use personally for audio books.
All of these AI as a Service (AaaS?) API companies are going to race each other to razor thin margins. Immediately after ElevenLabs raised, five other TTS services raised nearly the same amount of money.
Are you reading War & Peace or Cat In The Hat?
I doubt they're better than Google's TTS though.
https://github.com/suno-ai/bark Demo at https://huggingface.co/spaces/suno/bark
In the couple samples I tried it was substantially better at picking up meaning compared to VALL-E-X
I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:
* https://github.com/rhasspy/piper
I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.
The official samples are here: https://rhasspy.github.io/piper-samples/
Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/
While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.
(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)
[0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...
----
Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.
These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.
I have an accent. If not for that, I'd be a great presenter.
If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.
This isn't autotune for the spoken word, though. It's not fixing pacing or vocabulary, and in the audio above it isn't even fixing intonation. Yes, a thick German accent will give you away as being of German extraction. But you're also using the word 'since' when Brits and Americans would use 'for', and it's not going to fix that. Any more than it'll fix my french when I make the exact same mistake going the other direction (for=duration vs for=purpose vs for=interval). If I hear 'since one month' you're likely German or Indian. If you ask how long I've been in Marseille you'll know I'm American in about half that time.
> No current artificial intelligence is powerful enough to hide the weirdness of Weird Al.