One scenario I can think of is rolepaying - but I would assume that the slow streaming speed was kind of a feature there.
70B does significantly better in this regard. Nowhere close to perfection, but the frequency of WTFs about LLM's output are [subjectively] drastically lower.
Speed can be useful in RP if we'd run multiple LLM-based agents (like "plot", "goal checker", "inventory", "validation", "narrator") that function call each other to achieve some goal.
What exact use case did google.com enable you to do that made it worthwhile for everyone to immediately start using? It let you access nytimes.com? Access amazon.com? No, it let you ask off the wall, asinine, long tail questions no one else asked.
Or maybe a MMO with a town of NPCs.
Current "at home" inference tends to be limited by how much RAM your graphics card has, but system RAM scales better.
I just got llama3.1-8b (standard and instruct). However, I cannot do anything with it on my current hardware. Can you recommend the best AI model that I: 1) can self host 2) run on 16GB ram with no dedicated graphics card and an old intel i5 3) use on Debian without installing a bunch of exo-repo mystery code?
Any recommendation, directly or semi related would be appreciated - I'm doing my 'research' but haven't made much progress nor had any questions answered.
LM Studio [1] makes it very easy to run models locally and play with them. Llama 3.1 will only run in quantized form with 16GB RAM, and that cripples it quite badly, in my opinion.
You may try Phi-3 Mini, which has only 3.8B weights and can still do fun things.
The downside of the switching fabrics is that optimizing a design to fit an FPGA can sometimes take days.
I'd expect tokens out at 1 Ghz aggregate. Anything less than 1 Mhz is a joke.... ok, not a joke, but surprisingly slow.
Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.
Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.
I wonder why it doesn't output a billion tokens per second.
Still the price of one of these would be nuts if they'd sell them. Upwards of 1 million?
https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...
But I’m happy they got this far. It’s an ambitious vision, and it’s extra competition in a field where it’s severely lacking.
More discussion on official post: https://news.ycombinator.com/item?id=41369705