Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon (opens in new tab)

(github.com)

240 pointssanchitmonga223mo ago153 comments

Hi HN, we're Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech-to-text, text-to-speech – MetalRT beats llama.cpp, Apple's MLX, Ollama, and sherpa-onnx on every modality we tested. Custom Metal shaders, no framework overhead.

Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.

To get started:

  brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
  brew install rcli
  rcli setup   # downloads ~1 GB of models
  rcli         # interactive mode with push-to-talk

Or:

  curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):

LLM decode – 1.67x faster than llama.cpp, 1.19x faster than Apple MLX (same model files): - Qwen3-0.6B: 658 tok/s (vs mlx-lm 552, llama.cpp 295) - Qwen3-4B: 186 tok/s (vs mlx-lm 170, llama.cpp 87) - LFM2.5-1.2B: 570 tok/s (vs mlx-lm 509, llama.cpp 372) - Time-to-first-token: 6.6 ms

STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.

TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.

We built this because demoing on-device AI is easy but shipping it is brutal. Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure is.

The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken. You can't optimize one stage and call it done. Every stage needs to be fast, on one device, with no network round-trip to hide behind.

We went straight to Metal. Custom GPU compute shaders, all memory pre-allocated at init (zero allocations during inference), and one unified engine for all three modalities instead of stitching separate runtimes together.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:

LLM benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Speech benchmarks: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

How: Most inference engines add layers between you and the GPU: graph schedulers, runtime dispatchers, memory managers. MetalRT skips all of it. Custom Metal compute shaders for quantized matmul, attention, and activation - compiled ahead of time, dispatched directly.

Voice Pipeline optimizations details: https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai... RAG optimizations: https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...

RCLI is the open-source voice pipeline (MIT) built on MetalRT: three concurrent threads with lock-free ring buffers, double-buffered TTS, 38 macOS actions by voice, local RAG (~4 ms over 5K+ chunks), 20 hot-swappable models, and a full-screen TUI with per-op latency readouts. Falls back to llama.cpp when MetalRT isn't installed.

Source: https://github.com/RunanywhereAI/RCLI (MIT)

Demo: https://www.youtube.com/watch?v=eTYwkgNoaKg

What would you build if on-device AI were genuinely as fast as cloud?

Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

(github.com)

240 pointssanchitmonga223mo ago153 comments

Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.

To get started:

  brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
  brew install rcli
  rcli setup   # downloads ~1 GB of models
  rcli         # interactive mode with push-to-talk

Or:

  curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):

STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.

TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:

LLM benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Speech benchmarks: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

Voice Pipeline optimizations details: https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai... RAG optimizations: https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...

Source: https://github.com/RunanywhereAI/RCLI (MIT)

Demo: https://www.youtube.com/watch?v=eTYwkgNoaKg

What would you build if on-device AI were genuinely as fast as cloud?

153 comments

122 comments · 37 top-level

alfanick3mo ago· 13 in thread

I'm not looking for STT->AI->TTS, I'm looking for truly good voice-to-text experience* on Linux (and others). Siri/iOS-Dictation is truly good when it comes to understanding the speech. Something this level on Linux (and others) would be great, yeah always listening, maybe sending the data somewhere, but give me UX - hidden latency, optimizing for first chars recognized - a good (virtual) input device.

coder5433mo ago

> Siri/iOS-Dictation is truly good when it comes to understanding the speech.

What...? It is terrible, even compared to Whisper Tiny, which was released years ago under an Apache 2.0 license so Apple could have adopted it instantly and integrated it into their devices. The bigger Whisper models are far better, and Parakeet TDT V2 (English) / V3 (Multilingual) are quite impressive and very fast.

I have no idea what would make someone say that iOS dictation is good at understanding speech... it is so bad.

For a company that talks so much about accessibility, it is baffling to me that Apple continues to ship such poor quality speech to text with their devices.

derefr3mo ago

Maybe they have exactly the accent iOS dictation was trained to recognize.

solarkraft3mo ago

Its quality isn’t great, but it is damn fast and that matters a lot! Whisper doesn’t even work live without hacks.

coder5433mo ago

Parakeet is insanely fast and much more accurate, and it doesn't really matter that Whisper requires hacks to work live when those hacks have existed for years and work great. (The Hello Transcribe app on iOS is a great example of how well Whisper can work with live streaming on an iPhone. The smaller models are extremely fast, even with the "hacks".)

Parakeet TDT's architecture is actually a really cool way to boost both the speed and efficiency of real time STT compared to traditional approaches.

fragmede3mo ago

Terrible? It's fine. What's your accent that it's terrible? It even pulls last names from my address book and spells them right.

coder5433mo ago

Terrible relative to everything else that exists today. I have a neutral American accent.

Maybe you just don’t know what you’re missing? Google’s default speech to text is still bad compared to Whisper and Parakeet, but even Google’s is markedly better than Apple’s.

I cannot think of a single speech to text system that I’ve run into in the past 5 years that is less accurate than the one Apple ships.

Sure, Apple’s speech to text is incredible compared to what was on the flip phone I had 20 years ago. Terrible is relative. Much better options exist today, and they’re under very permissive licenses. Apple’s refusal to offer a better, more accessible experience to their users is frustrating when they wouldn’t even have to pay a licensing fee to ship something better. Whisper was released under a permissive license nearly 4 years ago.

Apple also restricts third party keyboards to an absurdly tiny amount of memory, so it isn’t even possible to ship a third party keyboard that provides more accurate on-device speech to text without janky workarounds (requiring the user to open the keyboard's own app first each time).

2 more replies

swindmill3mo ago

Have you tried https://handy.computer ?

alfanick3mo ago

Not bad, almost checks all the marks I want. A) Good quality, locally run model, and surprisingly fast and working on my CPU. B) It transcribes after the session is finished (aka stopped push-to-talk, or after stopping the listening). C) Ha nice, post-processing. D) Still not solved, truly realtime transcription with latency hiding - start typing as soon as you recognize sounds (or after some logical pause, i.e. at the end of sentence). E) Written in Rust, with web-browser config ui. F) Global shortcuts are super finnicky, doesn't recognize my default "Mic" button, fair enough, let me remap to some unused F24... Doesn't recognize F24 due to missing keycode.

It's there, doesn't feel native though. Good integration, not great though (Linux Mint/Cinnamon).