OpenAI compatibility (opens in new tab)

(ollama.ai)

643 pointsCasteil2y ago188 comments

188 comments

131 comments · 34 top-level

ultrasaurus2y ago· 20 in thread

The improvements in ease of use for locally hosting LLMs over the last few months have been amazing. I was ranting about how easy https://github.com/Mozilla-Ocho/llamafile is just a few hours ago [1]. Now I'm torn as to which one to use :)

1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

keriati12y ago

I think it is even easier right now for companies to self host an inference server with basic rag support:

- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui

No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.

vergessenmir2y ago

Personally I'd recommend Ollama, because they have a good model (dockeresque), the APIs are quite more widely supported

You can mix models in a single model file, it's a feature I've been experimenting with lately

Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp

xyc2y ago

The pace of progress here is pretty amazing. I loved how easy it is to get llamafile up and running, but I missed feature complete chat interfaces, so I built one based off it: https://recurse.chat/.

I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.

SOLAR_FIELDS2y ago

My primary use case would be to feed large internal codebases into an LLM with a much larger context window than what GPT-4 offers. Curious what the best options here are, in terms of model choice, speed, and ideas for prompt engineering

2 more replies

littlestymaar2y ago

What's up with the landing page though? Unless I'm not well awaken, there doesn't seem to be a download section or anything.

jondwillis2y ago

I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.

gnicholas2y ago

I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).

Does anyone know why this would be?

1 more reply

castles2y ago

To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?

1 more reply

a_wild_dandan2y ago

I've always used `llamacpp -m <model> -p <prompt>`. Works great as my daily driver of Mixtral 8x7b + CodeLlama 70b on my MacBook. Do alternatives have any killer features over Llama.cpp? I don't want to miss any cool developments.

CasteilOP2y ago

70b is probably going to be a bit slow for most on M-series MBPs (even with enough RAM), but Mixtral 8x7b does really well. Very usable @ 25-30T/s (64GB M1 Max), whereas 70b tends to run more like 3.5-5T/s.

'llama.cpp-based' generally seems like the norm.

Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.

Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.

[1] https://ollama.ai/library/mixtral

[2] https://github.com/ollama-webui/ollama-webui

1 more reply

skp19952y ago

I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.

2 more replies

ultrasaurus2y ago

Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.

I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)

livrem2y ago

With all the models I tried there was a quite a bit of fiddling for each one to get the correct command-line flags and a good prompt, or at least copy-paste some command-line from HF. Seems like every model needs its own unique prompt to give good results? I guess that is what the wrappers take care of? Other than that llama.cpp is very easy to use. I even run it on my phone in Termux, but only with a tiny model that is more entertaining than useful for anything.

1 more reply

mirekrusin2y ago

ollama is extremely convenient wrapper around llamacpp.

they separate serving heavy weights from model definition and usage itself.

what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.

the effect is that you have very low latency, very good interface - for programming api and ui.

ps. it's not only for macs

open weight models + (llama.app) as ollama + ollama-webui = real openai.

myaccountonhn2y ago

Curious if anyone has any recommendation for what LLM model to use today if you want a code assistant locally. Mistral?

thrdbndndn2y ago

From the blog article:

> A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.

I never tried to run these LLMs on my own machine -- is it this bad?

I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?

pitched2y ago

I would expect that 4060ti to get about 20-25 tokens per second on Mixtral. I can read at roughly 10-15 tokens per second so above that is where I see diminishing returns for a chatbot. Generating whole blog articles might have you sit waiting for a minute or so though.

2 more replies

jsjohnst2y ago

The Apple M1 is very useable with ollama using 7B parameter models and is virtually as “fast” as ChatGPT in responding. Obviously not same quality, but still useful.

Eisenstein2y ago

You can load a 7B parameter model quantized at Q4_K_M as gguf. I don't know ollama, but you can load it in koboldcpp -- use cuBLAS and gpu layers 100 context 2048 and it should fit it all into 8GB of VRAM. For quantized models look at TheBloke on huggingface -- Mistral 7B is a good one to try.

1 more reply

jwr2y ago

On an M3 MacBook Pro with 32GB of RAM, I can comfortably run 34B models like phind-codellama:34b-v2-q8_0.

Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.

swyx2y ago· 11 in thread

I know a few people privately unhappy that openai api compatibility is becoming a community standard. Apart from some awkwardness around data.choices.text.response and such unnecessary defensive nesting in the schema, I don't really have complaints.

wonder what pain points people have around the API becoming a standard, and if anyone has taken a crack at any alternative standards that people should consider.

simonw2y ago

I want it to be documented.

I'm fine with it emerging as a community standard if there's a REALLY robust specification for what the community considers to be "OpenAI API compatible".

Crucially, that standard needs to stay stable even if OpenAI have released a brand new feature this morning.

So I want the following:

- A very solid API specification, including error conditions

- A test suite that can be used to check that new implementations conform to that specification

- A name. I want to know what it means when software claims to be "compatible with OpenAI-API-Spec v3" (for example)

Right now telling me something is "OpenAI API compatible" really isn't enough information. Which bits of that API? Which particular date-in-time was it created to match?

londons_explore2y ago

It's a JSON API... JSON API's tend to be more... 'flexible'.

To consume them, just assume that every field is optional and extra fields might appear at any time.

swyx2y ago

and disappear at any time... was a leetle bit unsettled by the sudden deprecation of "functions" for "tools" with only minor apparante benefit

2 more replies

te_chris2y ago

Amen! The lack of decent errors from OpenAI is the most annoying. They'll silently return 400 with no explanation. Let's hope that doesn't catch on.

OpenAI compatible just seems to mean 'you can format your prompt like the `messages` array'.

1 more reply

Patrick_Devine2y ago

TBH, we debated about this a lot before adding it. It's weird being beholden to someone else's API which can dictate what features we should (or shouldn't) be adding to our own project. If we add something cool/new/different to Ollama will people even be able to use it since there isn't an equivalent thing in the OpenAI API?

minimaxir2y ago

That's more of a marketing problem than a technical problem. If there is indeed a novel use case with a good demo example that's not present in OpenAI's API, then people will use it. And if it's really novel, OpenAI will copy it into their API and thus the problem is no longer an issue.

The power of open source!

1 more reply

satellite22y ago

At some point, (probably in a relatively close future), there will be the AI Consortium (AIC) to decide what enters the common API?

minimaxir2y ago

That's why it's good as an option to minimize friction and reduce lock-in to OpenAI's moat.

sheepscreek2y ago

I would take an imperfect standard over no standard any day!

dimask2y ago

There is a difference between a standard and a monopoly, though.

tracerbulletx2y ago

It's so trivially easy to create your own web server in your language of choice that calls directly into llama.cpp functions with the bindings for your language of choice it doesn't really matter all that much. If you want more control you can get with just a little more work. You don't really need these plug and play things.

Havoc2y ago· 8 in thread

I don’t quite follow why people use ollama ? It sounds like lama.cpp with less features and training wheels

Is it just ease of use or is there something I’m missing?

dizhn2y ago

Downloading and activating models is very convenient. This llm stuff is really complicated and every little bit helps at the beginning. I only started two weeks ago and was very frustrated. A tool that just works is good for that kind of thing. Of course at that point you think it's their models and there's something special they are doing to the models etc. Honestly no tool that allows easy downloads goes out of their way to say they are just downloading TheBloke's gguf files and that the same models will run anywhere. (minus ollama's blob format on disk) :)

mark_l_watson2y ago

I started by using lama.cpp directly, and a few other options. I now just use Ollama because it is simple to download models, keep software and models up to date, and really easy to run a local REST query service. I like spending more time playing with application ideas and less time running infrastructure. Of course, lama.cpp under the hood provides the magic.

sp3322y ago

The CLI for llama.cpp is very clunky IMO. I put some kind of UI on it when I want to get something done.

spmurrayzzz2y ago

It also ships with an openai-compatible server implementation as well now that you could point your UI at (if you wanted to run leaner w/out ollama).

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

__loam2y ago

It's always ease of use lol. Thinking the best technology wins is a fallacy.

boarush2y ago

Ollama is just easier to use and serve the model on a local http server. I personally use it for testing stuff with llama-index as well. Pretty useful to say the least with zero configuration issues.

skp19952y ago

not sure why you are getting downvoted, its a very valid question. Its kind of down to the ergonomics of running LLM. Downloading a user friendly CLI tool with good UX beats having to clone a repo and run make files. llama.cpp is the better option if you want to do anything non-trivial when it comes to LLMs

titaniumtown2y ago

It's a wrapper around llama.cpp that provides a stable api

arbuge2y ago· 7 in thread

Genuinely curious to ask HN this: what are you using local models for?

codazoda2y ago

I got the most use out of it on an airplane with no wifi. It let me keep working on a coding solution without the internet because I could ask it quick questions. Magic.

mysteria2y ago

I use it for personal entertainment, both writing and roleplaying. I put quite a bit of effort into my own responses and actively edit the output to get decent results out of the larger 30B and 70B models. Trying out different models and wrangling the LLM to write what you want is part of the fun.

teruakohatu2y ago

Experimenting, as well as a cheaper alternative to cloud/paid models. Local models don't have the encyclopaedic knowledge as huge models such as GPT 3.5/4, but they can perform tasks well.

chown2y ago

I use it to compare outputs from different models (along with OpenAI, MistralAI) and pick-and-choose-and-compose those outputs. I wrote an app[1] that facilitates this. This also allows me to work offline mode and not having to worry about sharing client's data to OpenAI or Mistral AI

[1]: https://msty.app

RamblingCTO2y ago

I built myself a hacky alternative to the chat UI from openAI and implemented ollama to test different models locally. Also, openAI chat sucks, the API doesn't seem to suck as much. Chat is just useless for coding at this point.

/e: https://github.com/ChristianSch/theta

amelius2y ago

I'm hoping someone will write a tool to do project estimations. Like instead of my manager asking me "how long would it take you to implement X,Y,Z ...", he could use the LLM instead.

It doesn't even need to be very accurate because my own estimations aren't either :)

dimask2y ago

I used them to extract data from relatively unstructured reports into structured csv format. For privacy/gdpr reasons it was not something I could use an online model for. Saved me from a lot of manual work, and it did not hallucinate stuff as far as I could see.

thedangler2y ago· 6 in thread

Is Ollama model I can use locally to use for my own project and keep my data secure?

jasonjmcghee2y ago

Ollama is an easy way to run local models on Mac/linux. See https://ollama.ai they have a web UI and a terminal/server approach

MOARDONGZPLZ2y ago

I would not explicitly count on that. I’m a big fan of Ollama and use it every day but they do have some dark patterns that make me question a usecase where data security is a requirement. So I don’t use it where that is something that’s important.

jasonjmcghee2y ago

Ollama team are a few very down to earth, smart people. I really liked the folks I've met. I can't imagine they are doing anything malicious and I'm sure would address any issues (log them on GitHub) / entertain PRs to address any legitimate concerns

mbernstein2y ago

Examples?

slimsag2y ago

like what? If you're gonna accuse a project of shady stuff, at least give examples :)

1 more reply

v3ss0n2y ago

Opensource project so you can find evidence of foul play . Prove it or it is bs

ilaksh2y ago· 5 in thread

I think it's a little misleading to say it's compatible with OpenAI because I expect function or tool calling when you say that.

It's nice that you have the role and content thing but that was always fairly trivial to implement.

When it gets to agents you do need to execute actions. In the agent hosting system I started, I included a scripting engine, which makes me think that maybe I need to set up security and permissions for the agent system and just let it run code. Which is what I started.

So I guess I am not sure I really need the function/tool calling. But if I see a bunch of people actually am standardizing on tool calls then maybe I need it in my framework just because it will be expected. Even if I have arbitrary script execution.

minimaxir2y ago

The documentation is upfront about which features are excluded: https://github.com/ollama/ollama/blob/main/docs/openai.md

Function calling/tool choice is done at the application level and currently there's no standard format, and the popular ones are essentually inefficient bespoke system prompts: https://github.com/langchain-ai/langchain/blob/master/libs/l...

e12e2y ago

> Function calling/tool choice is done at the application level and currently there's no standard format,

Is this true for open ai - or just everything else?

ianbicking2y ago

I was drawn to Gemini Pro because it had function/tool calling... but it works terribly. (I haven't tried Gemini Ultra yet; unclear if it's available by API?)

Anyway, probably best that they didn't release support that doesn't work.

williamstein2y ago

Gemini Ultra is not available via API yet, at least according to the Google reps we talked with today. There's a waiting list. I suspect they are figuring out how to charge for API access, among other things. The announcement today only seemed to have pricing for the "$20/month" thing.

osigurdson2y ago

It makes obvious sense to anyone with experience with OpenAI APIs.

ptrhvns2y ago· 5 in thread

FYI: the Linux installation script for Ollama works in the "standard" style for tooling these days:

    curl https://ollama.ai/install.sh | sh

However, that script asks for root-level privileges via sudo the last time I checked. So, if you want the tool, you may want to download the script and have a look at it, or modify it depending on your needs.

Vinnl2y ago

They have manual install instructions [0], and judging by those, what it does is set up a SystemD service that automatically runs on startup. But if you're just looking to play around, I found that downloading [1], making it executable (chmod +x ollama-linux-amd64), and then running it, worked just fine. All without needing root.

[0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...

[1] https://ollama.ai/download/ollama-linux-amd64

dizhn2y ago

The ollama binary goes into /usr/bin which it doesn't have to but it's convenient. I haven't checked what else needs root access.

riffic2y ago

we have package managers in this day and age, lol.

jazzyjackson2y ago

do package managers make promises that they only distribute code that's been audited to not pwn you? I'm not sure I see the difference if I decided I'm going to run someone's software whether I install it with sudo apt install vs sudo curl | bash

2 more replies

jampekka2y ago

Sadly most of them kinda suck, especially for packagers.

2 more replies

behnamoh2y ago· 5 in thread

ollama seems like taking a page from langchain book: develop something that's open source but get it so popular that attracts VC money.

I never liked ollama, maybe because ollama builds on llama.cpp (a project I truly respect) but adds so much marketing bs.

For example, the @ollama account on twitter keeps shitposting on every possible thread to advertise ollama. The other day someone posted something about their Mac setup and @ollama said: "You can run ollama on that Mac."

I don't like it when +500 people are working tirelessly on llama.cpp and then guys like langchain, ollama, etc. rip off the benefits.

slimsag2y ago

Make something better, then. (I'm not being dismissive, I really genuinely mean it - please do)

I don't know who is behind Ollama and don't really care about them. I can agree with your disgust for VC 'open source' projects. But there's a reason they become popular and get investment: because they are valuable to people, and people use them.

If Ollama was just a wrapper over llama.cpp, then everyone would just use llama.cpp.

It's not just marketing, either. Compare the README of llama.cpp to the Ollama homepage, notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama. That's why it becomes valuable.

The same thing happened with Docker and we're just now barely getting a viable alternative after Docker as a company imploded, Podman Desktop, and even then it still suffers from major instability on e.g. modern macs.

The sooner open source devs in general learn to make their projects usable by an average developer, the sooner it will be competitive with these VC-funded 'open source' projects.

behnamoh2y ago

llama.cpp already has OpenAI compatible API.

It takes literally one line to install it (git clone and then make).

It takes one line to run the server as mentioned on their examples/server README.

    ./server -m <model> <any additional arguments like mmlock>

homarp2y ago

>notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama.

Sorry, I'm new to ollama 'ecosystem'.

From llama.cpp readme, I ctrl-F-ed "Node.js: withcatai/node-llama-cpp" and from there, I got to https://withcatai.github.io/node-llama-cpp/guide/

Can you explain how ollama does it 'easier' ?

FanaHOVA2y ago

ggml is also VC backed, so that has nothing to do with it.

udev40962y ago

I didn't know ollama was VC funded

lolpanda2y ago· 4 in thread

The compatibility layer can be also built in libraries. For example, Langchain has llm() which can work with multiple LLM backend. Which do you prefer?

avereveard2y ago

I'd prefer it in library but there are a number of issues with that currently, the larger of it being that the landscape moves too fast and library wrappers aren't keeping up. the other is, what if the world standardize on a terrible library like langchain we'd be stuck with it for a long time since maintenance cost of non uniform backend tend to kill possible runner ups. So for now the uniform api seems the choice of convenience.

Szpadel2y ago

but this means you need each library to support each llm, and I think this is the same issue what is with object storage where basically everyone support S3 compatible API

it's great to have some standard API even if that's isn't perfect, but having second API that allows you to use full potential (like B2 for backblaze) is also fine

so there isn't one model fits all, and if your model have different capabilities, then imo you should provide both options

SOLAR_FIELDS2y ago

This is hopefully much better than the s3 situation due to its simplicity. Many offerings that say “s3 compatible api” often mean “we support like 30% of api endpoints”. Granted often the most common stuff is supported and some stuff in the s3 api really only makes sense in AWS, but a good hunk of the s3 api is just hard or annoying to implement and a lot of vendors just don’t bother. Which ends up being rather annoying because you’ll pick some vendor and try to use an s3 client with it only to find out you can’t because of the 10% of the calls your client needs to make that are unsupported.

mise_en_place2y ago

Before OpenAI released their app I was using langchain in a system that I built. It was a very simple SMS interface to LLMs. I preferred working with langchain's abstractions over directly interfacing with the GPT4 API.

eclectic292y ago· 4 in thread

What's the use case of Ollama? Why should I not use llama.cpp directly?

TheCoreh2y ago

It's like a docker/package manager for the LLMs. You can easily install them, discover new ones, update them via a standardized, simple CLI. It also auto updates effortlessly.

dizhn2y ago

Yesterday I learned it also deduplicates skmiler model files.

jpdus2y ago

I have the same question. Noticed that Ollama got a lot of publicity and seems to be well received, but what exactly is the advantage over using llama.cpp (which also has a built-in server with OpenAI compatibility nowadays?) Directly?

visarga2y ago

ollama swaps models from the local library on the fly, based on the request args, so you can test against a bunch of models quickly

1 more reply

mrtimo2y ago· 3 in thread

I am business prof. I wanted my students to try out ollama (with web-ui), so I built some directions for doing so on google cloud [1]. If you use a spot instance you can run it for 18 cents an hour.

[1] https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...

ijustlovemath2y ago

The way you've set this up, your students could be too late to claim admin and have their instance hijacked. Very insecure. Would highly recommend you make them use an SSH key from git-bash; it's no more technical than anything you already have.

dizhn2y ago

You can run a lot of things on Google Colab for free as well. KoboldCPP has a nice premade thing on their website that can even load different models.

teruakohatu2y ago

Very useful thanks

ben_w2y ago· 3 in thread

I had trouble installing Ollama last time I tried, I'm going to try again tomorrow.

I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.

https://github.com/BenWheatley/YetAnotherChatUI

ben_w2y ago

Turns out my failure to install last time was due to thinking that the instructions on the python library blog post were complete installation instructions for the whole thing.

> pip install ollama

- https://ollama.ai/blog/python-javascript-libraries

is just the python libraries, not ollama itself, which the libraries need, and without which they will just…

> httpx.ConnectError: [Errno 61] Connection refused

Install the main app from the big friendly download button, and this problem fixed itself: https://ollama.ai/download

regularfry2y ago

If you don't care about the electron app and just want the API, you can `go generate ./... && go build && ./ollama serve` and you're off to the races. No installation needed.

ben_w2y ago

I made my web interface before I'd even heard of Ollama, and because I wanted a PAYG interface for GPT-4.

You also don't need to actually install my web UI, as it runs from the github page and the endpoint and API key are both configurable by the user during a chat session.

Also (a) the ollama command line interface is good enough for what I actually want, (b) my actual problem was not realising I'd only installed the python and not the underlying model.

laingc2y ago· 3 in thread

What's the current state-of-the-art in deploying large, "self-hosted" models to scalable infrastructure? (e.g. AWS or k8s)

Example use case would be to support a web application with, say, 100k DAU.

kkielhofner2y ago

Nvidia Triton Inference Server with the TensorRT-LLM backend:

https://github.com/triton-inference-server/tensorrtllm_backe...

It’s used by Mistral, AWS, Cloudflare, and countless others.

vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).

100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.

Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

Palmik2y ago

Do you have a source on Mistral API, etc. being based on TensoRT-LLM? And what are the main distinguishing features?

What I like about vLLM is the following:

- It exposes AsyncLLMEngine, which can be easily wrapped in any API you'd like.

- It has a logit processor API making it simple to integrate custom sampling logic.

- It has decent support for interference of quantized models.

1 more reply

laingc2y ago

Very helpful answer, thank you!

osigurdson2y ago· 3 in thread

Smart. When they do come, will the embedding vectors be OpenAI compatible? I assume this is quite hard to do.

minimaxir2y ago

Embeddings as an I/O schema are just text-in, a list of numbers out. There are very few embedding models which require enough preprocessing to warrant an abstraction. (A soft example is the new nomic-embed-text-v1, which requires adding prefix annotations: https://huggingface.co/nomic-ai/nomic-embed-text-v1 )

osigurdson2y ago

Yes of course (syntactically it is just float[] getEmbeddings(text)) but are the numbers close to what OpenAI would produce? I assume no.

1 more reply

dragonwriter2y ago

Probably not, embedding vectors aren't conpatible across different embedding models, and other tools presenting OAI-compatible APIs don't use OAI-compatible embedding models (e.g., oobabooga lets you configure different embeddings models, but none of them produce compatible vectors to the OAI ones.)

hubraumhugo2y ago· 2 in thread

It feels absolutely amazing to build AI startup right now:

- We first struggled with token limits [solved]

- We had issues with consistent JSON ouput [solved]

- We had rate limiting and performance issues for the large 3rd party models [solved]

- We wanted to reduce costs by hosting our own OSS models for small and medium complex tasks [solved]

It's like your product becomes automatically cheaper, more reliable, and more scalable with every new major LLM advancement.

Obivously you still need to build up defensibility and focus on differentiating with everything “non-AI”.

topicseed2y ago

> We first struggled with token limits [solved]

How has this been solved in your opinion? Do you mean with recent versions with much bigger limits but also heaps more expensive?

gitfan862y ago

The limits still exist but for certain use cases larger limits have helped

1 more reply

theogravity2y ago· 2 in thread

Isn't LangChain supposed to provide abstractions that 3rd parties shouldn't need to conform to OpenAI's API contract?

I know not everyone uses LangChain, but I thought that was one of the primary use-cases for it.

minimaxir2y ago

Which just then creates lock-in for LangChain's abstractions.

ludwik2y ago

Which are pretty awful btw - every project at my job that started with LangChain openly regrets it - the abstractions, instead of making hard things easy, trend to make the way things hard (and hard to debug and maintain).

3 more replies

lxe2y ago· 2 in thread

Does ollama support loaders other than llamacpp? I'm using oobabooga with exllama2 to run exl2 quants on a dual NVIDIA gpu, and nothing else seems to beat performance of it.

_ink_2y ago

I tried that, but failed to get the GPU split working. Do you have a link on how to do that?

lxe2y ago

Do what exactly? I have no issues with GPU split on oobabooga with either exl2 or gguf.

shay_ker2y ago· 1 in thread

Is Ollama effectively a dockerized HTTP server that calls llama.cpp directly? For the exception of this newly added OpenAI API ;)

okwhateverdude2y ago

More like an easy-mode llama.cpp that does a cgo wrapping of the lib (now; before they built patched llama.cpp runners and did IPC and managed child processes) and it does a few clever things to auto figure out layer splits (if you have meager GPU VRAM). The easy mode is that it will auto-load whatever model you'd like per request. They also implement docker-like layers for their representation of a model allowing you to overlay parameters of configuration and tag it. So far, it has been trivial to mix and match different models (or even the same models just with different parameters) for different tasks within the same application.

init02y ago· 1 in thread

Trying to openai am I missing something?

    import OpenAI from 'openai'

    const openai = new OpenAI({
      baseURL: 'http://localhost:11434/v1',
      apiKey: 'ollama', // required but unused
    })

    const chatCompletion = await 
      openai.chat.completions.create({
      model: 'llama2',
      messages: [{ role: 'user', content: 'Why is the sky blue?' }],
    })

    console.log(completion.choices[0].message.content)

I am getting the below error:

    return new NotFoundError(status, error, message, headers);
                   ^
    NotFoundError: 404 404 page not found

xena2y ago

Remove the v1

LightMachine2y ago· 1 in thread

Gemini Ultra release day, and a minor post on ollama OpenAI compatibility gets more points lol

subarctic2y ago

Who cares about another closed LLM that's no better than GPT 4? I think there's more exciting potential in open weights LLMs that you can run on your own machine and do whatever you want with.

bulbosaur1232y ago· 1 in thread

Anyone actually tested it with GPT4 api to see how well it performs?

minimaxir2y ago

That's not what this announcement is: it's an I/O schema for OSS local LLMs.

slimsag2y ago

Useful! At work we are building a better version of Copilot, and support bringing your own LLM. Recently I've been adding an 'OpenAI compatible' backend, so that if you can provide any OpenAI compatible API endpoint, and just tell us which model to treat it as, then we can format prompts, stop sequences, respect max tokens, etc. according to the semantics of that model.

I've been needing something exactly like this to test against in local dev environments :) Ollama having this will make my life / testing against the myriad of LLMs we need to support way, way easier.

Seems everyone is centralizing behind OpenAI API compatibility, e.g. there is OpenLLM and a few others which implement the same API as well.

patelajay2852y ago

We've been working on a project that provides this sort of easy swapping between open source (via HF, VLLM) & commercial models (OpenAI, Google, Anthropic, Together) in Python: https://github.com/datadreamer-dev/DataDreamer

It's a little bit easier to use if you want to do this without an HTTP API, directly in Python.

SamPatt2y ago

Ollama is great. If you want a GUI, LMStudio and Jan are great too.

I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.

https://github.com/sampatt/lookma

Roark662y ago

There has been a lot of progress with tools like llama.cpp and ollama, but despite slightly more difficult setup I prefer huggingface transformer based stuff(TGI for hosting, openllm proxy for (not at all)OpenAI compatibility). Why? Because you can bet the latest newest models are going to be supported in huggingface transformers library.

Llama.cpp is not far behind, but I find the well structured python code of transformers easy to modify and extend(with context free grammars, function calling etc) than just waiting for your favourite alternate runtime support a new model.

tosh2y ago

I wonder why ollama didn't namespace the path (e.g. under "/openai") but in any case this is great for interoperability.

678j53672y ago

Ollama is very good and runs better than some of the other tooling I have tried. It also Just Works™. I ran Dolphin Mixtral 7b on a Raspberry pi 4 off a 32 gig SD card. Barely had room. I asked it for a cornbread recipe, stepped away for a few hours and it had generated two characters. I was surprised it got that far if I am being honest.

syntaxing2y ago

Wow perfect timing. I personally love it. There’s so many projects out there that use OpenAI’s API whether you like it or not. I wanted to try this unit test writer notebook that OpenAI has but with Ollama. It was such a pain in the ass to fix it that I just didn’t bother cause it was just for fun. Now it should be 2 line of code change.

Implicated2y ago

Love it! Ollama has been such a wonderful project (at least, for me).

jhoechtl2y ago

How does ollama compare to H2o? We dabbled a bit with H2o and it looks very promising

https://gpt.h2o.ai/

jacooper2y ago

Does ollama support ROCm? It's not clear from their github repo if it does.

philprx2y ago

How does Ollama compare to LocalGPT ?

v01d4lph42y ago

This is super neat! Thanks folks!

udev40962y ago

Awesome!

j / k navigate · click thread line to collapse

188 comments

131 comments · 34 top-level

ultrasaurus2y ago· 20 in thread

1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

keriati12y ago

I think it is even easier right now for companies to self host an inference server with basic rag support:

- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui

vergessenmir2y ago

Personally I'd recommend Ollama, because they have a good model (dockeresque), the APIs are quite more widely supported

You can mix models in a single model file, it's a feature I've been experimenting with lately

Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp

xyc2y ago

The pace of progress here is pretty amazing. I loved how easy it is to get llamafile up and running, but I missed feature complete chat interfaces, so I built one based off it: https://recurse.chat/.

SOLAR_FIELDS2y ago

2 more replies

littlestymaar2y ago

What's up with the landing page though? Unless I'm not well awaken, there doesn't seem to be a download section or anything.

jondwillis2y ago

I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.

gnicholas2y ago

Does anyone know why this would be?

1 more reply

castles2y ago

To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?

1 more reply

a_wild_dandan2y ago

CasteilOP2y ago

'llama.cpp-based' generally seems like the norm.

Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.

[1] https://ollama.ai/library/mixtral

[2] https://github.com/ollama-webui/ollama-webui

1 more reply

skp19952y ago

I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.

2 more replies

ultrasaurus2y ago

Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.

livrem2y ago

1 more reply

mirekrusin2y ago

ollama is extremely convenient wrapper around llamacpp.

they separate serving heavy weights from model definition and usage itself.

the effect is that you have very low latency, very good interface - for programming api and ui.

ps. it's not only for macs

open weight models + (llama.app) as ollama + ollama-webui = real openai.

myaccountonhn2y ago

Curious if anyone has any recommendation for what LLM model to use today if you want a code assistant locally. Mistral?

thrdbndndn2y ago

From the blog article:

> A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.

I never tried to run these LLMs on my own machine -- is it this bad?

I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?

pitched2y ago

2 more replies

jsjohnst2y ago

The Apple M1 is very useable with ollama using 7B parameter models and is virtually as “fast” as ChatGPT in responding. Obviously not same quality, but still useful.

Eisenstein2y ago

1 more reply

jwr2y ago

On an M3 MacBook Pro with 32GB of RAM, I can comfortably run 34B models like phind-codellama:34b-v2-q8_0.

Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.

swyx2y ago· 11 in thread

wonder what pain points people have around the API becoming a standard, and if anyone has taken a crack at any alternative standards that people should consider.

simonw2y ago

I want it to be documented.

I'm fine with it emerging as a community standard if there's a REALLY robust specification for what the community considers to be "OpenAI API compatible".

Crucially, that standard needs to stay stable even if OpenAI have released a brand new feature this morning.

So I want the following:

- A very solid API specification, including error conditions

- A test suite that can be used to check that new implementations conform to that specification

- A name. I want to know what it means when software claims to be "compatible with OpenAI-API-Spec v3" (for example)

Right now telling me something is "OpenAI API compatible" really isn't enough information. Which bits of that API? Which particular date-in-time was it created to match?

londons_explore2y ago

It's a JSON API... JSON API's tend to be more... 'flexible'.

To consume them, just assume that every field is optional and extra fields might appear at any time.

swyx2y ago

and disappear at any time... was a leetle bit unsettled by the sudden deprecation of "functions" for "tools" with only minor apparante benefit

2 more replies

te_chris2y ago

Amen! The lack of decent errors from OpenAI is the most annoying. They'll silently return 400 with no explanation. Let's hope that doesn't catch on.

OpenAI compatible just seems to mean 'you can format your prompt like the `messages` array'.

1 more reply

Patrick_Devine2y ago

minimaxir2y ago

The power of open source!

1 more reply

satellite22y ago

At some point, (probably in a relatively close future), there will be the AI Consortium (AIC) to decide what enters the common API?

minimaxir2y ago

That's why it's good as an option to minimize friction and reduce lock-in to OpenAI's moat.

sheepscreek2y ago

I would take an imperfect standard over no standard any day!

dimask2y ago

There is a difference between a standard and a monopoly, though.

tracerbulletx2y ago

Havoc2y ago· 8 in thread

I don’t quite follow why people use ollama ? It sounds like lama.cpp with less features and training wheels

Is it just ease of use or is there something I’m missing?

dizhn2y ago

mark_l_watson2y ago

sp3322y ago

The CLI for llama.cpp is very clunky IMO. I put some kind of UI on it when I want to get something done.

spmurrayzzz2y ago

It also ships with an openai-compatible server implementation as well now that you could point your UI at (if you wanted to run leaner w/out ollama).

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

__loam2y ago

It's always ease of use lol. Thinking the best technology wins is a fallacy.

boarush2y ago

Ollama is just easier to use and serve the model on a local http server. I personally use it for testing stuff with llama-index as well. Pretty useful to say the least with zero configuration issues.

skp19952y ago

titaniumtown2y ago

It's a wrapper around llama.cpp that provides a stable api

arbuge2y ago· 7 in thread

Genuinely curious to ask HN this: what are you using local models for?

codazoda2y ago

I got the most use out of it on an airplane with no wifi. It let me keep working on a coding solution without the internet because I could ask it quick questions. Magic.

mysteria2y ago

teruakohatu2y ago

Experimenting, as well as a cheaper alternative to cloud/paid models. Local models don't have the encyclopaedic knowledge as huge models such as GPT 3.5/4, but they can perform tasks well.

chown2y ago

[1]: https://msty.app

RamblingCTO2y ago

/e: https://github.com/ChristianSch/theta

amelius2y ago

I'm hoping someone will write a tool to do project estimations. Like instead of my manager asking me "how long would it take you to implement X,Y,Z ...", he could use the LLM instead.

It doesn't even need to be very accurate because my own estimations aren't either :)

dimask2y ago

thedangler2y ago· 6 in thread

Is Ollama model I can use locally to use for my own project and keep my data secure?

jasonjmcghee2y ago

Ollama is an easy way to run local models on Mac/linux. See https://ollama.ai they have a web UI and a terminal/server approach

MOARDONGZPLZ2y ago

jasonjmcghee2y ago

mbernstein2y ago

Examples?

slimsag2y ago

like what? If you're gonna accuse a project of shady stuff, at least give examples :)

1 more reply

v3ss0n2y ago

Opensource project so you can find evidence of foul play . Prove it or it is bs

ilaksh2y ago· 5 in thread

I think it's a little misleading to say it's compatible with OpenAI because I expect function or tool calling when you say that.

It's nice that you have the role and content thing but that was always fairly trivial to implement.

minimaxir2y ago

The documentation is upfront about which features are excluded: https://github.com/ollama/ollama/blob/main/docs/openai.md

e12e2y ago

> Function calling/tool choice is done at the application level and currently there's no standard format,

Is this true for open ai - or just everything else?

ianbicking2y ago

I was drawn to Gemini Pro because it had function/tool calling... but it works terribly. (I haven't tried Gemini Ultra yet; unclear if it's available by API?)

Anyway, probably best that they didn't release support that doesn't work.

williamstein2y ago

osigurdson2y ago

It makes obvious sense to anyone with experience with OpenAI APIs.

ptrhvns2y ago· 5 in thread

FYI: the Linux installation script for Ollama works in the "standard" style for tooling these days:

    curl https://ollama.ai/install.sh | sh

Vinnl2y ago

[0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...

[1] https://ollama.ai/download/ollama-linux-amd64

dizhn2y ago

The ollama binary goes into /usr/bin which it doesn't have to but it's convenient. I haven't checked what else needs root access.

riffic2y ago

we have package managers in this day and age, lol.

jazzyjackson2y ago

2 more replies

jampekka2y ago

Sadly most of them kinda suck, especially for packagers.

2 more replies

behnamoh2y ago· 5 in thread

ollama seems like taking a page from langchain book: develop something that's open source but get it so popular that attracts VC money.

I never liked ollama, maybe because ollama builds on llama.cpp (a project I truly respect) but adds so much marketing bs.

I don't like it when +500 people are working tirelessly on llama.cpp and then guys like langchain, ollama, etc. rip off the benefits.

slimsag2y ago

Make something better, then. (I'm not being dismissive, I really genuinely mean it - please do)

If Ollama was just a wrapper over llama.cpp, then everyone would just use llama.cpp.

The sooner open source devs in general learn to make their projects usable by an average developer, the sooner it will be competitive with these VC-funded 'open source' projects.

behnamoh2y ago

llama.cpp already has OpenAI compatible API.

It takes literally one line to install it (git clone and then make).

It takes one line to run the server as mentioned on their examples/server README.

    ./server -m <model> <any additional arguments like mmlock>

homarp2y ago

>notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama.

Sorry, I'm new to ollama 'ecosystem'.

From llama.cpp readme, I ctrl-F-ed "Node.js: withcatai/node-llama-cpp" and from there, I got to https://withcatai.github.io/node-llama-cpp/guide/

Can you explain how ollama does it 'easier' ?

FanaHOVA2y ago

ggml is also VC backed, so that has nothing to do with it.

udev40962y ago

I didn't know ollama was VC funded

lolpanda2y ago· 4 in thread

The compatibility layer can be also built in libraries. For example, Langchain has llm() which can work with multiple LLM backend. Which do you prefer?

avereveard2y ago

Szpadel2y ago

but this means you need each library to support each llm, and I think this is the same issue what is with object storage where basically everyone support S3 compatible API

it's great to have some standard API even if that's isn't perfect, but having second API that allows you to use full potential (like B2 for backblaze) is also fine

so there isn't one model fits all, and if your model have different capabilities, then imo you should provide both options

SOLAR_FIELDS2y ago

mise_en_place2y ago

eclectic292y ago· 4 in thread

What's the use case of Ollama? Why should I not use llama.cpp directly?

TheCoreh2y ago

It's like a docker/package manager for the LLMs. You can easily install them, discover new ones, update them via a standardized, simple CLI. It also auto updates effortlessly.

dizhn2y ago

Yesterday I learned it also deduplicates skmiler model files.

jpdus2y ago

visarga2y ago

ollama swaps models from the local library on the fly, based on the request args, so you can test against a bunch of models quickly

1 more reply

mrtimo2y ago· 3 in thread

I am business prof. I wanted my students to try out ollama (with web-ui), so I built some directions for doing so on google cloud [1]. If you use a spot instance you can run it for 18 cents an hour.

[1] https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...

ijustlovemath2y ago

dizhn2y ago

You can run a lot of things on Google Colab for free as well. KoboldCPP has a nice premade thing on their website that can even load different models.

teruakohatu2y ago

Very useful thanks

ben_w2y ago· 3 in thread

I had trouble installing Ollama last time I tried, I'm going to try again tomorrow.

I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.

https://github.com/BenWheatley/YetAnotherChatUI

ben_w2y ago

Turns out my failure to install last time was due to thinking that the instructions on the python library blog post were complete installation instructions for the whole thing.

> pip install ollama

- https://ollama.ai/blog/python-javascript-libraries

is just the python libraries, not ollama itself, which the libraries need, and without which they will just…

> httpx.ConnectError: [Errno 61] Connection refused

Install the main app from the big friendly download button, and this problem fixed itself: https://ollama.ai/download

regularfry2y ago

If you don't care about the electron app and just want the API, you can `go generate ./... && go build && ./ollama serve` and you're off to the races. No installation needed.

ben_w2y ago

I made my web interface before I'd even heard of Ollama, and because I wanted a PAYG interface for GPT-4.

You also don't need to actually install my web UI, as it runs from the github page and the endpoint and API key are both configurable by the user during a chat session.

Also (a) the ollama command line interface is good enough for what I actually want, (b) my actual problem was not realising I'd only installed the python and not the underlying model.

laingc2y ago· 3 in thread

What's the current state-of-the-art in deploying large, "self-hosted" models to scalable infrastructure? (e.g. AWS or k8s)

Example use case would be to support a web application with, say, 100k DAU.

kkielhofner2y ago

Nvidia Triton Inference Server with the TensorRT-LLM backend:

https://github.com/triton-inference-server/tensorrtllm_backe...

It’s used by Mistral, AWS, Cloudflare, and countless others.

vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).

Anyway, the point is Triton is just about the only thing out there for use in this general range and up.

Palmik2y ago

Do you have a source on Mistral API, etc. being based on TensoRT-LLM? And what are the main distinguishing features?

What I like about vLLM is the following:

- It exposes AsyncLLMEngine, which can be easily wrapped in any API you'd like.

- It has a logit processor API making it simple to integrate custom sampling logic.

- It has decent support for interference of quantized models.

1 more reply

laingc2y ago

Very helpful answer, thank you!

osigurdson2y ago· 3 in thread

Smart. When they do come, will the embedding vectors be OpenAI compatible? I assume this is quite hard to do.

minimaxir2y ago

osigurdson2y ago

Yes of course (syntactically it is just float[] getEmbeddings(text)) but are the numbers close to what OpenAI would produce? I assume no.

1 more reply

dragonwriter2y ago

hubraumhugo2y ago· 2 in thread

It feels absolutely amazing to build AI startup right now:

- We first struggled with token limits [solved]

- We had issues with consistent JSON ouput [solved]

- We had rate limiting and performance issues for the large 3rd party models [solved]

- We wanted to reduce costs by hosting our own OSS models for small and medium complex tasks [solved]

It's like your product becomes automatically cheaper, more reliable, and more scalable with every new major LLM advancement.

Obivously you still need to build up defensibility and focus on differentiating with everything “non-AI”.

topicseed2y ago

> We first struggled with token limits [solved]

How has this been solved in your opinion? Do you mean with recent versions with much bigger limits but also heaps more expensive?

gitfan862y ago

The limits still exist but for certain use cases larger limits have helped

1 more reply

theogravity2y ago· 2 in thread

Isn't LangChain supposed to provide abstractions that 3rd parties shouldn't need to conform to OpenAI's API contract?

I know not everyone uses LangChain, but I thought that was one of the primary use-cases for it.

minimaxir2y ago

Which just then creates lock-in for LangChain's abstractions.

ludwik2y ago

3 more replies

lxe2y ago· 2 in thread

Does ollama support loaders other than llamacpp? I'm using oobabooga with exllama2 to run exl2 quants on a dual NVIDIA gpu, and nothing else seems to beat performance of it.

_ink_2y ago

I tried that, but failed to get the GPU split working. Do you have a link on how to do that?

lxe2y ago

Do what exactly? I have no issues with GPU split on oobabooga with either exl2 or gguf.

shay_ker2y ago· 1 in thread

Is Ollama effectively a dockerized HTTP server that calls llama.cpp directly? For the exception of this newly added OpenAI API ;)

okwhateverdude2y ago

init02y ago· 1 in thread

Trying to openai am I missing something?

    import OpenAI from 'openai'

    const openai = new OpenAI({
      baseURL: 'http://localhost:11434/v1',
      apiKey: 'ollama', // required but unused
    })

    const chatCompletion = await 
      openai.chat.completions.create({
      model: 'llama2',
      messages: [{ role: 'user', content: 'Why is the sky blue?' }],
    })

    console.log(completion.choices[0].message.content)

I am getting the below error:

    return new NotFoundError(status, error, message, headers);
                   ^
    NotFoundError: 404 404 page not found

xena2y ago

Remove the v1

LightMachine2y ago· 1 in thread

Gemini Ultra release day, and a minor post on ollama OpenAI compatibility gets more points lol

subarctic2y ago

Who cares about another closed LLM that's no better than GPT 4? I think there's more exciting potential in open weights LLMs that you can run on your own machine and do whatever you want with.

bulbosaur1232y ago· 1 in thread

Anyone actually tested it with GPT4 api to see how well it performs?

minimaxir2y ago

That's not what this announcement is: it's an I/O schema for OSS local LLMs.

slimsag2y ago

Seems everyone is centralizing behind OpenAI API compatibility, e.g. there is OpenLLM and a few others which implement the same API as well.

patelajay2852y ago

It's a little bit easier to use if you want to do this without an HTTP API, directly in Python.

SamPatt2y ago

Ollama is great. If you want a GUI, LMStudio and Jan are great too.

I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.

https://github.com/sampatt/lookma

Roark662y ago

tosh2y ago

I wonder why ollama didn't namespace the path (e.g. under "/openai") but in any case this is great for interoperability.

678j53672y ago

syntaxing2y ago

Implicated2y ago

Love it! Ollama has been such a wonderful project (at least, for me).

jhoechtl2y ago

How does ollama compare to H2o? We dabbled a bit with H2o and it looks very promising

https://gpt.h2o.ai/

jacooper2y ago

Does ollama support ROCm? It's not clear from their github repo if it does.

philprx2y ago

How does Ollama compare to LocalGPT ?

v01d4lph42y ago

This is super neat! Thanks folks!

udev40962y ago

Awesome!

j / k navigate · click thread line to collapse