undefined | Better HN

0 pointsbmodel1y ago0 comments

We intercept api calls and use our own implementation to forward them to a remote machine. No eBPF (which I believe need to run in the kernel).

As for latency, we've done a lot of work to minimize that as much as possible. You can see the performance we get running inference on BERT from huggingface here: https://youtu.be/qsOBFQZtsFM?t=64. It's still slower than local (mainly for training workloads) but not by as much as you'd expect. We're aiming to reach near parity in the next few months!

0 comments

2 comments · 2 top-level

samstave1y ago

When you release a self-host version, what would be really neat would be to see it across HFT focused NICs that have huge TCP buffers...

https://www.arista.com/assets/data/pdf/HFT/HFTTradingNetwork...

Basically taking into account the large buffers and super-time-sensitive nature of HFT networking optimizations, I wonder if your TCP<-->GPU might benefit from both the HW and the learnings of NFT stylings?

ZeroCool2u1y ago

Got it. eBPF module run as part of the kernel, but they're still user space programs.

I would would consider using a larger model for demonstrating inference performance as I have 7B models deployed on CPU at work, but GPU is still important training BERT size models.

j / k navigate · click thread line to collapse