1: Quite literally hours ago: https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/
- get a Mac Mini or Mac Studio - just run ollama serve, - run ollama web-ui in docker - add some coding assitant model from ollamahub with the web-ui - upload your documents in the web-ui
No code needed, you have your self hosted LLM with basic RAG giving you answers with your documents in context. For us the deepseek coder 33b model is fast enough on a Mac Studio with 64gb ram and can give pretty good suggestions based on our internal coding documentation.
You can mix models in a single model file, it's a feature I've been experimenting with lately
Note: you don't have to rely on their model Library, you can use your own. Secondly, support for new models is through their bindings with llama.cpp
I still need GPT-4 for some tasks, but in daily usage it's replaced much of ChatGPT usage, especially since I can import all of my ChatGPT chat history. Also curious to learn about what people want to do with local AI.
Does anyone know why this would be?
'llama.cpp-based' generally seems like the norm.
Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.
Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.
I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)
they separate serving heavy weights from model definition and usage itself.
what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.
the effect is that you have very low latency, very good interface - for programming api and ui.
ps. it's not only for macs
open weight models + (llama.app) as ollama + ollama-webui = real openai.
> A few pip install X’s and you’re off to the races with Llama 2! Well, maybe you are, my dev machine doesn’t have the resources to respond on even the smallest model in less than an hour.
I never tried to run these LLMs on my own machine -- is it this bad?
I guess if I only have a moderate GPU, say a 4060TI, there is no chance I can play with it, then?
Unfortunately, having tried this and a bunch of other models, they are all toys compared to GPT-4.
wonder what pain points people have around the API becoming a standard, and if anyone has taken a crack at any alternative standards that people should consider.
I'm fine with it emerging as a community standard if there's a REALLY robust specification for what the community considers to be "OpenAI API compatible".
Crucially, that standard needs to stay stable even if OpenAI have released a brand new feature this morning.
So I want the following:
- A very solid API specification, including error conditions
- A test suite that can be used to check that new implementations conform to that specification
- A name. I want to know what it means when software claims to be "compatible with OpenAI-API-Spec v3" (for example)
Right now telling me something is "OpenAI API compatible" really isn't enough information. Which bits of that API? Which particular date-in-time was it created to match?
To consume them, just assume that every field is optional and extra fields might appear at any time.
OpenAI compatible just seems to mean 'you can format your prompt like the `messages` array'.
The power of open source!
Is it just ease of use or is there something I’m missing?
https://github.com/ggerganov/llama.cpp/blob/master/examples/...
[1]: https://msty.app
It doesn't even need to be very accurate because my own estimations aren't either :)
It's nice that you have the role and content thing but that was always fairly trivial to implement.
When it gets to agents you do need to execute actions. In the agent hosting system I started, I included a scripting engine, which makes me think that maybe I need to set up security and permissions for the agent system and just let it run code. Which is what I started.
So I guess I am not sure I really need the function/tool calling. But if I see a bunch of people actually am standardizing on tool calls then maybe I need it in my framework just because it will be expected. Even if I have arbitrary script execution.
Function calling/tool choice is done at the application level and currently there's no standard format, and the popular ones are essentually inefficient bespoke system prompts: https://github.com/langchain-ai/langchain/blob/master/libs/l...
Is this true for open ai - or just everything else?
Anyway, probably best that they didn't release support that doesn't work.
curl https://ollama.ai/install.sh | sh
However, that script asks for root-level privileges via sudo the last time I checked. So, if you want the tool, you may want to download the script and have a look at it, or modify it depending on your needs.[0] https://github.com/ollama/ollama/blob/main/docs/linux.md#man...
I never liked ollama, maybe because ollama builds on llama.cpp (a project I truly respect) but adds so much marketing bs.
For example, the @ollama account on twitter keeps shitposting on every possible thread to advertise ollama. The other day someone posted something about their Mac setup and @ollama said: "You can run ollama on that Mac."
I don't like it when +500 people are working tirelessly on llama.cpp and then guys like langchain, ollama, etc. rip off the benefits.
I don't know who is behind Ollama and don't really care about them. I can agree with your disgust for VC 'open source' projects. But there's a reason they become popular and get investment: because they are valuable to people, and people use them.
If Ollama was just a wrapper over llama.cpp, then everyone would just use llama.cpp.
It's not just marketing, either. Compare the README of llama.cpp to the Ollama homepage, notice the stark contrast of how difficult getting llama.cpp connected to some dumb JS app is compared to Ollama. That's why it becomes valuable.
The same thing happened with Docker and we're just now barely getting a viable alternative after Docker as a company imploded, Podman Desktop, and even then it still suffers from major instability on e.g. modern macs.
The sooner open source devs in general learn to make their projects usable by an average developer, the sooner it will be competitive with these VC-funded 'open source' projects.
It takes literally one line to install it (git clone and then make).
It takes one line to run the server as mentioned on their examples/server README.
./server -m <model> <any additional arguments like mmlock>Sorry, I'm new to ollama 'ecosystem'.
From llama.cpp readme, I ctrl-F-ed "Node.js: withcatai/node-llama-cpp" and from there, I got to https://withcatai.github.io/node-llama-cpp/guide/
Can you explain how ollama does it 'easier' ?
it's great to have some standard API even if that's isn't perfect, but having second API that allows you to use full potential (like B2 for backblaze) is also fine
so there isn't one model fits all, and if your model have different capabilities, then imo you should provide both options
[1] https://docs.google.com/document/d/1OpZl4P3d0WKH9XtErUZib5_2...
I've already got a web UI that "should" work with anything that matches OpenAI's chat API, though I'm sure everyone here knows how reliable air-quotes like that are when a developer says them.
> pip install ollama
- https://ollama.ai/blog/python-javascript-libraries
is just the python libraries, not ollama itself, which the libraries need, and without which they will just…
> httpx.ConnectError: [Errno 61] Connection refused
Install the main app from the big friendly download button, and this problem fixed itself: https://ollama.ai/download
You also don't need to actually install my web UI, as it runs from the github page and the endpoint and API key are both configurable by the user during a chat session.
Also (a) the ollama command line interface is good enough for what I actually want, (b) my actual problem was not realising I'd only installed the python and not the underlying model.
Example use case would be to support a web application with, say, 100k DAU.
https://github.com/triton-inference-server/tensorrtllm_backe...
It’s used by Mistral, AWS, Cloudflare, and countless others.
vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance).
100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc.
Anyway, the point is Triton is just about the only thing out there for use in this general range and up.
What I like about vLLM is the following:
- It exposes AsyncLLMEngine, which can be easily wrapped in any API you'd like.
- It has a logit processor API making it simple to integrate custom sampling logic.
- It has decent support for interference of quantized models.
- We first struggled with token limits [solved]
- We had issues with consistent JSON ouput [solved]
- We had rate limiting and performance issues for the large 3rd party models [solved]
- We wanted to reduce costs by hosting our own OSS models for small and medium complex tasks [solved]
It's like your product becomes automatically cheaper, more reliable, and more scalable with every new major LLM advancement.
Obivously you still need to build up defensibility and focus on differentiating with everything “non-AI”.
I know not everyone uses LangChain, but I thought that was one of the primary use-cases for it.
import OpenAI from 'openai'
const openai = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // required but unused
})
const chatCompletion = await
openai.chat.completions.create({
model: 'llama2',
messages: [{ role: 'user', content: 'Why is the sky blue?' }],
})
console.log(completion.choices[0].message.content)
I am getting the below error: return new NotFoundError(status, error, message, headers);
^
NotFoundError: 404 404 page not foundI've been needing something exactly like this to test against in local dev environments :) Ollama having this will make my life / testing against the myriad of LLMs we need to support way, way easier.
Seems everyone is centralizing behind OpenAI API compatibility, e.g. there is OpenLLM and a few others which implement the same API as well.
It's a little bit easier to use if you want to do this without an HTTP API, directly in Python.
I'm building a React Native app to connect mobile devices to local LLM servers run with these programs.
Llama.cpp is not far behind, but I find the well structured python code of transformers easy to modify and extend(with context free grammars, function calling etc) than just waiting for your favourite alternate runtime support a new model.