Show HN: liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching (opens in new tab)

(github.com)

140 pointsij232y ago34 comments

Hello hacker news,

I’m the maintainer of liteLLM() - package to simplify input/output to OpenAI, Azure, Cohere, Anthropic, Hugging face API Endpoints: https://github.com/BerriAI/litellm/

We’re open sourcing our implementation of liteLLM proxy: https://github.com/BerriAI/litellm/blob/main/cookbook/proxy-...

TLDR: It has one API endpoint /chat/completions and standardizes input/output for 50+ LLM models + handles logging, error tracking, caching, streaming

What can liteLLM proxy do? - It’s a central place to manage all LLM provider integrations

- Consistent Input/Output Format - Call all models using the OpenAI format: completion(model, messages) - Text responses will always be available at ['choices'][0]['message']['content']

- Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)

- Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, Helicone

- Token Usage & Spend - Track Input + Completion tokens used + Spend/model

- Caching - Implementation of Semantic Caching

- Streaming & Async Support - Return generators to stream text responses

You can deploy liteLLM to your own infrastructure using Railway, GCP, AWS, Azure

Happy completion() !

Show HN: liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching

(github.com)

140 pointsij232y ago34 comments

Hello hacker news,

I’m the maintainer of liteLLM() - package to simplify input/output to OpenAI, Azure, Cohere, Anthropic, Hugging face API Endpoints: https://github.com/BerriAI/litellm/

We’re open sourcing our implementation of liteLLM proxy: https://github.com/BerriAI/litellm/blob/main/cookbook/proxy-...

TLDR: It has one API endpoint /chat/completions and standardizes input/output for 50+ LLM models + handles logging, error tracking, caching, streaming

What can liteLLM proxy do? - It’s a central place to manage all LLM provider integrations

- Consistent Input/Output Format - Call all models using the OpenAI format: completion(model, messages) - Text responses will always be available at ['choices'][0]['message']['content']

- Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)

- Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, Helicone

- Token Usage & Spend - Track Input + Completion tokens used + Spend/model

- Caching - Implementation of Semantic Caching

- Streaming & Async Support - Return generators to stream text responses

You can deploy liteLLM to your own infrastructure using Railway, GCP, AWS, Azure

Happy completion() !

34 comments

27 comments · 12 top-level

kiratp2y ago· 2 in thread

This would be super useful if it supported local/in-K8-cluster models.

Most production use cases probably need some sort of fall back of small -> medium -> large -> GPT4.

Given the costs + low quota limits, I would be surprised if any significant portion of the market is falling back from one expensive proprietary* API to another.

With the rapid improvements in model servers like llama.cpp and vllm.ai, providing an abstraction layer for “fastest model server of the month” would be useful.

ij23OP2y ago

What local/in-K8-cluster models servers would you recommend adding ?

Should we add support for llama.cpp and vllm.ai in the proxy server ? Or should we assume you can host them on your own infra and the proxy server requests your hosted model ?

kiratp2y ago

IMO don’t try to be the one stop shop to host models. There are too many players with all sorts of advancements (eg: stopping grammar, continuous batching, novel quantization etc.) and you won’t be able to keep up.

There is a ton of boilerplate around the actual model server that’s just busy work , but if done wrong can be a huge performance suck. Solve that.

Build the proxy that works with the most model servers out there. Do it in a way that once you have mindshare, the model server makers will be find it easy to put up a PR so that they can claim your proxy supports their server.

Don’t take a hard dependency on non-OSS stuff - being able to build an “on-prem” solution (read “deployed into customer’s VPC”) is table stakes for anyone to use your offering for a lot of enterprise use cases.

Edit: another unsolved problem - different models need slightly different prompts to solve the same problem well…

1 more reply

jmorgan2y ago· 2 in thread

The idea of an LLM proxy is super compelling. There's a lot of powerful ideas baked into the proxy form factor – I think you've listed out quite a few of them. It reminds me a bit of what Cloudflare did for the web: both making it faster and safer/easier. Have you considered local LLMs at all for Llama 2? A few people and I have been working on https://github.com/jmorganca/ollama/ and was thinking it would be helpful to be able to augment it with a proxy layer like this. Not only that, but it might help folks dynamically choose to run locally (vs against a cloud LLM) for certain prompts.