You need an API key - I got one from https://cloud.cerebras.ai/ but I'm not sure if there's a waiting list at the moment - then you can do this:
pipx install llm # or brew install llm or uv tool install llm
llm install llm-cerebras
llm keys set cerebras
# paste key here
Then you can run lightning fast prompts like this: llm -m cerebras-llama3.1-70b 'an epic tail of a walrus pirate'
Here's a video of that running, it's very speedy: https://static.simonwillison.net/static/2024/cerebras-is-fas...I just tried that out with the same prompt and it's fast, but not as fast as Cerebras: https://static.simonwillison.net/static/2024/gemini-flash-8b...
For our use case, we may get 1 audio file at a time, we may get 10. Of course queuing them is possible but we decided to prioritize speed & reliability over self hosting.
[1]: https://arxiv.org/pdf/2409.11055v1 [2]: https://lmarena.ai/
At that rate it doesn't matter if the first try resulted in an unwanted answer, you'll be able to run once or twice more in a fast succession.
I hope their hardware stays relevant as this field continues to evolve
Fast iteration is a killer feature, for sure, but at this time I'd rather focus on quality for it to be worthwhile the effort.
There are LLMs today that are amazing at coding, and when you allow it to iterate (eg. respond to compiler errors), the quality is pretty impressive. If you can run an LLM 3x faster, you can enable a much bigger feedback loop in the same period of time.
There are efforts to enable LLMs to "think" by using Chain-of-thought, where the LLM writes out reasoning in a "proof" style list of steps. Sometimes, like with a person, they'd reach a dead-end logic wise. If you can run 3x faster, you can start to run the "thought chain" as more of a "tree" where the logic is critiqued and adapted, and where many different solutions can be tried. This can all happen in parallel (well, each sub-branch).
Then there are "agent" use cases, where an LLM has to take actions on its own in response to real-world situations. Speed really impacts user-perception of quality.
But you're assuming that it'll always ne validated by humans. I'd imagine that most validation (and subsequent processing, especially going forward) will be done on machines.
python3 examples/basic/chat.py -m cerebras/llama3.1-70b
Specifying the model and setting up basic chat is simple (and there are numerous other examples in the examples folder in the repo): import langroid.language_models as lm
import langroid as lr
llm_config = lm.OpenAIGPTConfig(chat_model= "cerebras/llama3.1-70b")
agent = lr.ChatAgent(
lr.ChatAgentConfig(llm=llm_config, system_message="Be helpful but concise"))
)
task = lr.Task(agent)
task.run()
[1] https://github.com/langroid/langroid
[2] https://github.com/langroid/langroid/blob/main/examples/basi...
[3] Guide to using Langroid with non-OpenAI LLM APIs https://langroid.github.io/langroid/tutorials/local-llm-setu...It sort of is starting to look like you can linearly boost utility by exponentially scaling token usage per query. If so we might see companies slowing on scaling parameters and instead focusing on scaling token usage.
And then there are use cases like OpenAI's o1, where most tokens aren't even generated for the benefit of a human, but as input for itself.
The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.
It seems they also support only a very short sequence length. (1k tokens)
> At 16 RU, and peak sustained system power of 23kW, the CS-3 packs the performance of a room full of servers into a single unit the size of a dorm room mini-fridge.
It's pretty impressive looking hardware.
As far as I know Nvidia still has a monopoly on the training part.
https://github.com/microsoft/BitNet
"bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. "
I believe we'll get to hear more interesting things about Bitnet in the future.