Here is a quick(ish) screen-recroding explaining how it works: https://youtu.be/ZpY6SIkBosE
Best results when training a custom router on your own prompt data: https://youtu.be/9JYqNbIEac0
The router balances user preferences for quality, speed and cost. The end result is higher quality and faster LLM responses at lower cost.
The quality for each candidate LLM is predicted ahead of time using a neural scoring function, which is a BERT-like architecture conditioned on the prompt and a latent representation of the LLM being scored. The different LLMs are queried across the batch dimension, with the neural scoring architecture taking a single latent representation of the LLM as input per forward pass. This makes the scoring function very modular to query for different LLM combinations. It is trained in a supervised manner on several open LLM datasets, using GPT4 as a judge. The cost and speed data is taken from our live benchmarks, updated every few hours across all continents. The final "loss function" is a linear combination of quality, cost, inter-token-latency and time-to-first-token, with the user effectively scaling the weighting factors of this linear combination.
Smaller LLMs are often good enough for simple prompts, but knowing exactly how and when they might break is difficult. Simple perturbations of the phrasing can cause smaller LLMs to fail catastrophically, making them hard to rely on. For example, Gemma-7B converts numbers to strings and returns the "largest" string when asking for the "largest" number in a set, but works fine when asking for the "highest" or "maximum".
The router is able to learn these quirky distributions, and ensure that the smaller, cheaper and faster LLMs are only used when there is high confidence that they will get the answer correct.
Pricing-wise, we charge the same rates as the backend providers we route to, without taking any margins. We also give $50 in free credits to all new signups.
The router can be used off-the-shelf, or it can be trained directly on your own data for improved performance.
What do people think? Could this be useful?
Feedback of all kinds is welcome!
What's your plan for making money? Are you planning to eventually take a margin? Negotiate discounts with your backend providers? Mine the data flowing through your system?
If I start using you now you’ll either disappear in the future or you’ll suddenly start charging more, neither of which I like.
I’m already paying for inference, a little amount on top of that for the convenience of a single API is pretty useful.
We spend $30k+ per month with OpenAI and Anthropic
Even minor prompt changes between minor model versions can have huge differences in output
We make specific tweaks to all our dozens of production prompts based on the exact model that will be used
Treating LLMs as if they are interchangeable is simply bogus
My hunch is that quite soon, LLMs will be totally interchangeable. Due to the intense competition and the fact that people are basically training on the same base data distribution.
In the tasks I'm using LLMs switching one for another makes less difference than I had predicted even.
However, I'm not spending $30k+ per month, so I guess my opinion may be less informed.
What is your use case? Could these micro-optimizations you need to do now be the result of the technology still being quite immature?
I'm working with digital twins of musicians/celebrities. But have also done more analytical stuff with LLMs.
My current side project involves working with the production company of a well-known German soap opera to help them write further episodes. The first thing we did was write a small evaluation system. Could be interesting to test with Unify.
Our prompts get very complex and we have around 3,000 lines of code that does nothing but build prompts based on user's options (using dozens of helpers in other files)
They aren't going to get more interchangeable because of the subjective nature of them
Give five humans the same task and even if that task is just a little bit complex you'll get wildly different results. And the more complex it gets the more different the results will become
It's the same with LLMs. Most of our prompt changes are more significant but in one recent case it was a simple as changing the word "should" to "must" to get similar behavior between two different models.
One of them basically ignored things we said it "should" do and never performed the thing we wanted it to whereas the other did it often, despite these being minor version differences of the same model
On our roadmap, we plan to support:
- an API which returns the neural scores directly, enabling model selection and model-specific prompts to all be handled on the client side
- automatic learning of intermediate prompts for agentic multi-step systems, taking a similar view as DSPy, where all intermediate LLM calls and prompts are treated as latent variables in an optimizable end-to-end agentic system.
With these additions, the subtleties of the model + prompt relationship would be better respected.
I also believe that LLMs will become more robust to prompt subtleties over time. Also, some tasks are less sensitive to these minor subtleties you refer to.
For example, if you have a sales call agent, you might want to optimize UX for easy dialgoue prompts (so the person on the other end isn't left waiting), and take longer thinking about harder prompts requiring the full context of the call.
This is just an example, but my point is that not all LLM applications are the same. Some might be super sensitive to prompt subtleties, others might not be.
Thoughts?
It's already hard enough to get consistent behavior with a fixed model
If we need to save money we will switch to a cheaper model and adapt our prompts for that
If we are going more for quality we'll use and more expensive model and adapt our prompts for that
I fail to see any use case where I would want a third party choosing which model we are using at run time...
We are adding a new model this week and I've spent dozens of hours personally evaluating output and making tweaks to make it feasible
Making it sound like models are interchangeable is harmful
The pattern I often see is companies prototyping on the most expensive models, then testing smaller/faster/cheaper models to determine what is actually required for production. For which contexts and products do you foresee your approach being superior?
Given you're just passing along inference costs from backend providers and aren't taking margin, what's your long-term plan for profitability?
We generally see the router being useful when the LLM application is being scaled, and cost and speed start to matter a lot. However, in some cases the output quality actually improved, as we're able to squeeze the best of GPT4 and Claude etc.
Long-term plan for profitability would come from some future version of the router, where we save the user time and money, and then charge some overhead for the router, but with the user still paying less than they would be with a single endpoint. Hopefully that makes sense?
Happy to answer any other questions!
I always thought a product like this that could empirically decrease costs for the same performance or increase performance for a small increase in cost would have a fairly simple road of justifying its existence.
That said, while I've really enjoyed the LLM abstraction (making it easy for me to test different models without changing my code), I haven't felt any desire for a router. I _do_ have some prompts that I send to gpt-3.5-turbo, and could potentially use other models, but it's kind of niche.
In part this is because I try to do as much in a single prompt as I can, meaning I want to use a model that's able to handle the hardest parts of the prompt and then the easy parts come along with. As a result there's not many "easy" prompts. The easy prompts are usually text fixup and routing.
My "routing" prompts are at a different level of abstraction, usually routing some input or activity to one of several prompts (each of which has its own context, and the sum of all contexts across those prompts is too large, hence the routing). I don't know if there's some meaningful crossover between these two routing concepts.
Another issue I have with LLM portability is the use of tools/functions/structured output. Opus and Gemini Pro 1.5 have kind of implemented this OK, but until recently GPT was the only halfway decent implementation of this. This seems to be an "advanced" feature, yet it's also a feature I use even more with smaller prompts, as those small prompts are often inside some larger algorithm and I don't want the fuss of text parsing and exceptions from ad hoc output.
But in the end I'm not price sensitive in my work, so I always come back to the newest GPT model. If I make a switch to Opus it definitely won't be to save money! And I'm probably not going to want to fiddle, but instead make a thoughtful choice and switch the default model in my code.
We are very committed to the proxy :)
Although, to your point, we have seen less market pull for routing, and more for (a) supporting the latest LLMs, (b) basic translation (e.g. tool call API b/w Anthropic & OpenAI), and (c) solid infra features like caching/load balancing api keys/secret management. So that's our focus.
However, for several use cases speed is really paramount, and can directly hinder the UX. Examples include sales call agents, copilots, auto-complete engines etc. These are some of the areas where we've seen the router really shine, diverting to slow models when absolutely necesary on complex prompts, but using fast models as often as possible to minimize disruption to the UX.
Having said that, another major benefit of the platform is the ability to quickly run objective benchmarks for quality, cost and speed across all models and providers, on your own prompts [https://youtu.be/PO4r6ek8U6M]. We have some users who run benchmarks regularly for different checkpoints of their fine-tuned model, comparing against all other custom fine-tuned models, as well as the various foundation models.
As for the overlap in routing concepts you mentioned, I've thought a lot about this actually. It's our intention to broaden the kinds of routing we're able to handle, where we assume all control flow decision (routing) and intermediate prompts are latent variables (DSPy perspective). In the immediate future there is not crossover though.
I agree cost is often an afterthought. Generally our users either care about improving speed, or they want to know which model or combination of models would be best for their task in terms of output quality (GPT-4, Opus, Gemini? etc.). This is not trivial to guage without performing benchmarks.
As for usually wanting to make a full LLM switch as opposed to routing, what's the primary motivation? Avoiding extra complexity + dependencies in the stack? Perhaps worrying about model-specific prompts no longer working well with a new model? The general loss of control?
Does this mean GPT4 predictions are used as labels? Is that allowed?
The chatbot providers will enjoy seeing how they stack up against each other. Your data/analytics/alerts will be valuable to them. Revenue opportunity there.
I think about comparison data companies like https://www.ipsos.com when I see your product.
I'm not sure what the SEO equivalent would be here...
(Least Hallucinated Response)
The question: "You are on the shore of a river with a goat and a boat. How do you get to the other side?"
I expect it to tell me "Use the boat to get to the other side"
Instead it over fits to the solution of the logic puzzle and throws actual logic out the window.
GPT-4-Turbo
To get to the other side of the river with a goat and a boat, you can follow these steps, assuming the boat can only carry you and one other item (the goat or something else if you had more items) at a time:
Take the goat across the river: Start by taking the goat in the boat with you and row across the river. Leave the goat on the other side.
Return alone: Row back across the river alone, leaving the goat on the far side.
Cross again if needed: If you had more items to bring across (like a wolf and cabbage in the classic puzzle), you'd now take another item across, but since it's just you and the goat, you're done. You and the goat are both on the other side of the river.
Since the scenario only mentions a goat and a boat, and no other items or animals that could complicate the process (like needing to keep the goat away from a cabbage), the task is straightforward: just take the goat across the river with you in the boat.
llama-3b (this one is hilarious actually) What a clever question!
Since I have a goat and a boat, I can use the goat to help me cross the river. Here's what I'd do:
First, I'd tie the goat to the boat using a rope or a long piece of twine.
Then, I'd push the boat into the river and let it drift away from the shore.
As the boat floats away, the goat would naturally try to follow the boat, since it's tied to it.
Once the goat is in the middle of the river, I'd swim out to the boat and untie the goat.
With the goat now free, I'd climb back into the boat and row or paddle it to the other side of the river.
Once I reach the other side, I'd tie the goat to a tree or a rock to keep it safe and secure.
Voilà! I've successfully crossed the river with the help of my trusty goat and boat!
Claude-3-Haiku To get the goat, yourself, and the boat to the other side of the river, you can follow these steps:
Take the goat across the river in the boat.
Leave the goat on the other side and come back in the empty boat.
Take yourself across the river in the boat.
Now that you're on the other side, you can bring the boat back to the original side and get the goat.
This way, you can transport the goat to the other side of the river without leaving it alone with the boat, which could lead to the goat getting lost or the boat drifting away.
The key is to make multiple trips, transporting one item at a time, to ensure that all the necessary elements (you, the goat, and the boat) end up on the other side of the river safely.
Mistral-7b-Instruct-V0.2 (Mistral-Ai) You leave the goat on the shore, go across the river in the boat, and then return to get the goat. Since you can't leave the goat unattended and the boat is too small to carry both the goat and yourself at the same time, this is the only possible solution.To cross the river with the goat using the boat, you would follow these steps:
1. *Get into the boat with the goat.* 2. *Row to the other side of the river.* 3. *Disembark with the goat.*
Now, both you and the goat are on the other side of the river.
People use the same model / server for all queries not because it's sensible, but because it's simple. This brings the same simplicity to the far more optimal solution.
And great startup play too, by definition no incumbent can fill this role.
That is, how do you quantify "better"?
Imagine I asked: Write me a sad poem about "Yorick" in Hamlet?
You have to predict which LLM is going to be produce the "best" poem, but by what criterion do you evaluate that?
https://python.langchain.com/v0.1/docs/use_cases/query_analy...
https://docs.llamaindex.ai/en/stable/examples/query_engine/R...
> "Input your Unify APhttps://github.com/Anteemony/RAG"
Your product looks good in my view, although I have only spend about 10min thus far. The docs look pretty easy to follow along.
I'll probably give this a try soon!
Having benchmarks (I assume this is ongoing?) would make it even more interesting, as I wouldn't have to manually manage the routing each time I add a new model.
You mentioned you don't have a margin right now, but how about keeping logs, the dashboard for cost, and benchmarking?
If the user is having a dialog with the system, they start out with a trivial question, and are routed to a simple LLM, are they stuck there? Or can the conversation be transitioned to something more powerful?
Something like this will be needed to direct traffic as systems specialize more.
I don’t know if choosing different models for the same consumer can be problematic (seen as not consistent), but maybe using this approach will force the post-processing code not to be “coupled” with one particular model.
Would love to see web access and RAG (LlamaIndex) integration. Are they on the roadmap?
"I have expensive taste, please use the most expensive model."
How can you retrieve the latent representation of the candidate LLMs? Some models do not have open weights (such as GPT-4), which means AFAIK it is impossible to directly access the hidden latent space through their API.
Am I missing something?
In contrast, our router sits at a higher level of the stack, sending prompts to different models and providers based on quality on the prompt distribution, speed and cost. Happy to clarify further if helpful!
Best email would be: daniel.lenton@unify.ai
Cheers!
This is sort of how Mixture-of-Experts models work, actually.
For those who don't want to always route, another core benefit of our platform is simple custom benchmarking on your task across all existing providers: https://youtu.be/PO4r6ek8U6M
If you then just want to use that provider rather than a router config, then that's fine!
Some of the main differences would be: - we focus on performance based routing, optimizing speed, cost and quality [https://youtu.be/ZpY6SIkBosE] - we enable custom benchmarks on custom prompts, across all models + providers [https://youtu.be/PO4r6ek8U6M] - we enable custom routers to be trained on custom data [https://youtu.be/9JYqNbIEac0]
Our users often already have LLM apps deployed, and are then looking to take better control of their performance profile, perhaps increasing speed to improve user experience, or improving response quality via clear benchmarking across all models and providers on their particular prompts.
So they are similar, but solving slightly different problems I'd say.
I have come across Portkey's Open-source AI Gateway which kind of does the same.
https://github.com/portkey-ai/gateway
It looks like with more LLM adoption, resiliency and cost related aspects take off sooner than expected unlike other technological trends in the past.
I'm also thinking that there is a chance if something like this could help build a better RAG pipeline or evals for the GenAI App. Because end of the day you want to reduce hallucinations but still get good generative responses.