Kokoro WebGPU: Real-time text-to-speech 100% locally in the browser (opens in new tab)

(huggingface.co)

227 pointsxenova1y ago53 comments

53 comments

It took some time, but we finally got Kokoro TTS (v1.0) running in-browser w/ WebGPU acceleration! This enables real-time text-to-speech without the need for a server. Looking forward to your feedback!

amelius1y ago

Now that's what I call "server-less" computing!

deivid1y ago

Amazing! I'm interested in models running locally and Kokoro seems amazing. Are you aware of similar models but for Speech to text?

xenovaOP1y ago

We have released a bunch of speech recognition demos (using whisper, moonshine, and others). For example:

- https://huggingface.co/spaces/Xenova/whisper-web

- https://huggingface.co/spaces/Xenova/whisper-webgpu

- https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu

- https://huggingface.co/spaces/webml-community/moonshine-web

alex_young1y ago

The realtime Whisper demo is amazing.

How can I understand what's in the compiled JS though? Is there some source for that?

Ono-Sendai1y ago

whisper

sebastiennight1y ago

This is brilliant. All we need now is for someone to code a frontend for it so we can input an article's URL and have this voice read it out loud... built-in local voices on MacOS are not even close to this Kokoro model

satvikpendem1y ago

There are a few already, I assume MacWhisper will add it. That being said, I am also working on a (crossplatform, in Flutter) UI for this.

sebastiennight1y ago

My understanding is that MacWhisper is a front-end for Whisper.cpp so... it does Speech-to-text? (transcribing what you dictate)

Here I'm talking about the model shared in this thread, which is text-to-speech (reading out loud content from the web)

1 more reply

waynenilsen1y ago

Incredible work! I have listened to several tts and to have this be free and in complete control of the customer is absolutely incredible. This will unlock new use cases

I made https://app.readaloudto.me/ as a hobby thing and now it could be enhanced with a local tts option!

reach-vb1y ago

Brilliant job! Love how fast it is, I'm sure if the rapid pace of speech ML continues we'll have Speech to Speech models directly running in our browser!

dust421y ago

It's already there, Hibiki by Kyutai.org was released yesterday with speech to speech, french to english on Iphone:

https://x.com/neilzegh/status/1887498102455869775

https://github.com/kyutai-labs/hibiki

moralestapia1y ago

This is great but far from real-time.

(I get the joke that for some definition of real-time this is real-time).

The reason why I use an API is because time to first byte is the most important metric in the apps I'm working on.

That aside, kudos for the great work and I'm sure one day the latency on this will be super low as well.

itishappy1y ago

Sounds terrible on Chrome with an AMD 5700XT.

Sounds great on Chrome with an Nvidia 1650Ti.

Sounds great on Chrome on a Pixel 6.

Sound like being bitcrushed. Maybe a 64 vs 32 bit error? Solid results when working.

ASalazarMX1y ago

Ubuntu 24.04 LTS. Works great on Firefox, on Chromium audio files are silent, even when downloaded and opened with a media player.

Edit: Sorry, it was a problem of my specific audio setup, it works equally well on Chromium.

SubiculumCode1y ago

Kokoro gives pretty good voices and is quite light...making it useful despite its lack of voice cloning capability. However, I haven't figured out how to run it in the context of a tts server without homebrewing the server...which maybe is easy? IDK.

phildougherty1y ago

https://github.com/remsky/Kokoro-FastAPI

C-Loftus1y ago

Fantastic work. My dream would be to use this for a browser audiobook generator for epubs. I made a cli audiobook generator with Piper [0] that got some traction and I wanted to port it to the browser, but there were too many issues. [1]

Is there source anywhere? Seems the assets/ folder is bundled js. In my opinion, there's a ton of opportunity for private, progressive web apps with this while WebGPU is still relatively newly implemented.

Would love to collaborate in some way if others are also interested in this

[0] https://github.com/C-Loftus/QuickPiperAudiobook/ [1] https://github.com/rhasspy/piper/issues/352

Asmod4n1y ago

Sounds horribly in chrome with an amd gpu, why is that?

mdaniel1y ago

Are you somehow implying that everyone in the AI arms race believes that only CUDA exists?! /s

But, in a more serious tone: the story that I hear about AMD GPUs is that they are, in fact, shittier because AMD themselves give fewer shits. GIGO

CyberDildonics1y ago

What is this comment saying? You think the results are different just because of AMD hardware? If there is a difference it would be a software bug.

dragonwriter1y ago

Everyone in the space only caring about (and therefore testing on) Nvidia/CUDA as suggested in GP is exactly why a software bug that seriously impacts results but only effects AMD GPUs would get through into released software very easily.

1 more reply

realsid1y ago

Amazing ! This is my first time witnessing a model of such prowess run in browser. Curious about quantization and webml

yawnxyz1y ago

holy cow, how did they get the OpenAI voices like Alloy and Echo, generated in-browser and sounding 99% the same?

this is astounding

djeastm1y ago

Fyi I tried this on my Galaxy S21 with both Brave and Chrome browsers and just got screeching noises in the audio

mewse-hn1y ago

the mere idea of voice software's error mode being uncontrollable screeching is the most hilarious thing to me

scarface_741y ago

Isn’t this already built into browsers?

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

C-Loftus1y ago

WebGPU actually generates the speech entirely in the browser. Web Speech is great too, but less practical if the model is complicated to set up and integrate with the speech API on the host.

scarface_741y ago

I don’t understand. From what I can tell, it’s natively supported on all modern browsers and on Windows, Macs, iOS and Android

moron4hire1y ago

The implementation of the Web Speech API usually involves the specific browser vendor calling out to their own, proprietary, cloud-based TTS APIs. I say "usually" because, for a time, Microsoft used their local Windows Speech API in Edge, but I believe they've stopped that and have largely deprecated Windows Speech for Azure Speech even at the OS level.

1 more reply

scarface_741y ago

Any luck with getting this running on iOS 18.2.1 running Safari? I have tfe WebGPU feature flag turned on (Settings -> Safari -> Advanced) and I’ve tried a few other WebGPU demos successfully

jasonjmcghee1y ago

Loads for a while then crashes for me. Guessing too much RAM usage

butz1y ago

Generating audio takes a bit, but wow, 92MB model for really decent sounding speech. Is there a way to plug this thing into speech dispatcher on Linux and use for accessibility?

ranger_danger1y ago

How do I download this and run it actually offline?

starbugs1y ago

I didn't try it yet, but it seems you should be able to do that via their GitHub repo: https://github.com/nazdridoy/kokoro-tts

BlueUmarell1y ago

Does anybody know if the results can be saved to file, or the results somehow retrieved?

fallinditch1y ago

Brave browser and Samsung Galaxy S22 ultra - gives horrible screeching noises

magicalhippo1y ago

Firefox on Samsung S21, worked fine albeit slow, around 20-25s for the demo text.

Quality sounded good compared to a lot of other small TTS models I've tried.

nnadams1y ago

Yeah this only worked with Firefox on my phone. All other browsers generated a screechy noise instead.

shaneofalltrad1y ago

same in MacOS intel Chrome browser.

Guillaume861y ago

Same with chrome on Zenfone 8

rado1y ago

Crashes the iPad Safari tab

oliwary1y ago

Worked on my Pixel 6a, albeit quite slowly (~30s for 4s audio). Still really impressed.

darkwater1y ago

Yep, same here. Pixel 6a and Firefox, it takes a while but it sounds pretty good

zamadatix1y ago

Mobile Safari (includes iPad) does not like to dish out large amounts of memory.

dindresto1y ago

Same on macOS Safari (Sequoia, Safari 18.3, M3 Pro, 18gb RAM)

bentt1y ago

Sounded perfect for me. Brave/Win11/3090

koinedad1y ago

Crashes on iPhone safari

j / k navigate · click thread line to collapse

53 comments

xenovaOP1y ago

amelius1y ago

Now that's what I call "server-less" computing!

deivid1y ago

Amazing! I'm interested in models running locally and Kokoro seems amazing. Are you aware of similar models but for Speech to text?

xenovaOP1y ago

We have released a bunch of speech recognition demos (using whisper, moonshine, and others). For example:

- https://huggingface.co/spaces/Xenova/whisper-web

- https://huggingface.co/spaces/Xenova/whisper-webgpu

- https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu

- https://huggingface.co/spaces/webml-community/moonshine-web

alex_young1y ago

The realtime Whisper demo is amazing.

How can I understand what's in the compiled JS though? Is there some source for that?

Ono-Sendai1y ago

whisper

sebastiennight1y ago

satvikpendem1y ago

There are a few already, I assume MacWhisper will add it. That being said, I am also working on a (crossplatform, in Flutter) UI for this.

sebastiennight1y ago

My understanding is that MacWhisper is a front-end for Whisper.cpp so... it does Speech-to-text? (transcribing what you dictate)

Here I'm talking about the model shared in this thread, which is text-to-speech (reading out loud content from the web)

1 more reply

waynenilsen1y ago

Incredible work! I have listened to several tts and to have this be free and in complete control of the customer is absolutely incredible. This will unlock new use cases

I made https://app.readaloudto.me/ as a hobby thing and now it could be enhanced with a local tts option!

reach-vb1y ago

Brilliant job! Love how fast it is, I'm sure if the rapid pace of speech ML continues we'll have Speech to Speech models directly running in our browser!

dust421y ago

It's already there, Hibiki by Kyutai.org was released yesterday with speech to speech, french to english on Iphone:

https://x.com/neilzegh/status/1887498102455869775

https://github.com/kyutai-labs/hibiki

moralestapia1y ago

This is great but far from real-time.

(I get the joke that for some definition of real-time this is real-time).

The reason why I use an API is because time to first byte is the most important metric in the apps I'm working on.

That aside, kudos for the great work and I'm sure one day the latency on this will be super low as well.

itishappy1y ago

Sounds terrible on Chrome with an AMD 5700XT.

Sounds great on Chrome with an Nvidia 1650Ti.

Sounds great on Chrome on a Pixel 6.

Sound like being bitcrushed. Maybe a 64 vs 32 bit error? Solid results when working.

ASalazarMX1y ago

Ubuntu 24.04 LTS. Works great on Firefox, on Chromium audio files are silent, even when downloaded and opened with a media player.

Edit: Sorry, it was a problem of my specific audio setup, it works equally well on Chromium.

SubiculumCode1y ago

phildougherty1y ago

https://github.com/remsky/Kokoro-FastAPI

C-Loftus1y ago

Would love to collaborate in some way if others are also interested in this

[0] https://github.com/C-Loftus/QuickPiperAudiobook/ [1] https://github.com/rhasspy/piper/issues/352

Asmod4n1y ago

Sounds horribly in chrome with an amd gpu, why is that?

mdaniel1y ago

Are you somehow implying that everyone in the AI arms race believes that only CUDA exists?! /s

But, in a more serious tone: the story that I hear about AMD GPUs is that they are, in fact, shittier because AMD themselves give fewer shits. GIGO

CyberDildonics1y ago

What is this comment saying? You think the results are different just because of AMD hardware? If there is a difference it would be a software bug.

dragonwriter1y ago

1 more reply

realsid1y ago

Amazing ! This is my first time witnessing a model of such prowess run in browser. Curious about quantization and webml

yawnxyz1y ago

holy cow, how did they get the OpenAI voices like Alloy and Echo, generated in-browser and sounding 99% the same?

this is astounding

djeastm1y ago

Fyi I tried this on my Galaxy S21 with both Brave and Chrome browsers and just got screeching noises in the audio

mewse-hn1y ago

the mere idea of voice software's error mode being uncontrollable screeching is the most hilarious thing to me

scarface_741y ago

Isn’t this already built into browsers?

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

C-Loftus1y ago

WebGPU actually generates the speech entirely in the browser. Web Speech is great too, but less practical if the model is complicated to set up and integrate with the speech API on the host.

scarface_741y ago

I don’t understand. From what I can tell, it’s natively supported on all modern browsers and on Windows, Macs, iOS and Android

moron4hire1y ago

1 more reply

scarface_741y ago

Any luck with getting this running on iOS 18.2.1 running Safari? I have tfe WebGPU feature flag turned on (Settings -> Safari -> Advanced) and I’ve tried a few other WebGPU demos successfully

jasonjmcghee1y ago

Loads for a while then crashes for me. Guessing too much RAM usage

butz1y ago

Generating audio takes a bit, but wow, 92MB model for really decent sounding speech. Is there a way to plug this thing into speech dispatcher on Linux and use for accessibility?

ranger_danger1y ago

How do I download this and run it actually offline?

starbugs1y ago

I didn't try it yet, but it seems you should be able to do that via their GitHub repo: https://github.com/nazdridoy/kokoro-tts

BlueUmarell1y ago

Does anybody know if the results can be saved to file, or the results somehow retrieved?

fallinditch1y ago

Brave browser and Samsung Galaxy S22 ultra - gives horrible screeching noises

magicalhippo1y ago

Firefox on Samsung S21, worked fine albeit slow, around 20-25s for the demo text.

Quality sounded good compared to a lot of other small TTS models I've tried.

nnadams1y ago

Yeah this only worked with Firefox on my phone. All other browsers generated a screechy noise instead.

shaneofalltrad1y ago

same in MacOS intel Chrome browser.

Guillaume861y ago

Same with chrome on Zenfone 8

rado1y ago

Crashes the iPad Safari tab

oliwary1y ago

Worked on my Pixel 6a, albeit quite slowly (~30s for 4s audio). Still really impressed.

darkwater1y ago

Yep, same here. Pixel 6a and Firefox, it takes a while but it sounds pretty good

zamadatix1y ago

Mobile Safari (includes iPad) does not like to dish out large amounts of memory.

dindresto1y ago

Same on macOS Safari (Sequoia, Safari 18.3, M3 Pro, 18gb RAM)

bentt1y ago

Sounded perfect for me. Brave/Win11/3090

koinedad1y ago

Crashes on iPhone safari

j / k navigate · click thread line to collapse