Ask HN: What is the current (Apr. 2024) gold standard of running an LLM locally?

195 pointsjs982y ago88 comments

There are many options and opinions about, what is currently the recommended approach for running an LLM locally (e.g., on my 3090 24Gb)? Are options ‘idiot proof’ yet?

88 comments

71 comments · 26 top-level

aantix2y ago· 13 in thread

Ollama is really easy.

brew install ollama

brew services start ollama

ollama pull mistral

Ollama you can query via http. It provides a consistent interface for prompting, regardless of model.

https://github.com/ollama/ollama/blob/main/docs/api.md#reque...

brushfoot2y ago

For Windows users without brew, there's a Windows installer:

https://ollama.com/download/windows

WinGet and Scoop apparently also have it. Chocolatey doesn't seem to.

ActorNightly2y ago

Works very well in WSL2 as well. I prefer that as you have to start the server manually, so it doesn't sit in the background.

crazypython2y ago

It lacks batching support (n>1), unfortunately, which is necessary for Loom-like applications

aantix2y ago

Loom - this application?

https://medium.com/@tofujoy77/loom-ai-uncovering-creative-wr...

brushfoot2y ago

How does Ollama distribute these models? The downloads page on the Llama website, https://llama.meta.com/llama-downloads/, has a heading that reads:

> Request access to Llama

Which to me gives the impression that access is gated and by application only. But Ollama downloaded it without so much as a "y".

Is that just Meta's website UI? Registration isn't actually required?

cymor2y ago

It doesn't distribute. The app pulls from hugging face.

1 more reply

whimsicalism2y ago

because this field is a baby, noncompliance is rampant

eatonphil2y ago

This repo doesn't seem to say anything about what it does with anything you pass it. There's no privacy policy or anything? It being open source doesn't necessarily mean it isn't passing all my data somewhere. I didn't see anything about this in the repo that everything definitely stays local.

Zambyte2y ago

I would be much more concerned if it had a privacy policy, to the point where just having one means I probably wouldn't use it. That is not common practice for Free Software that runs on your machine. The only network operations that Ollama has is managing LLMs (ie: download Mistral from their server).

joenot4432y ago

> I didn't see anything about this in the repo that everything definitely stays local.

If this is your concern, I'd encourage you to read the code yourself. If you find it meets the bar you're expecting, then I'd suggest you submit a PR which updates the README to answer your question.

okwhateverdude2y ago

I run the docker container locally. As far as I can tell, it doesn't call home or anything (from reading the source and from opensnitch). It is just a cgo wrapped llama.cpp that provides an HTTP API. It CAN fetch models from their library, but you can just as easily load in your own GGUF formatted llama models. They implement a docker-like layers mechanism for model configuration that is pretty useful.

aantix2y ago

It is free.. So why not just audit the source personally to put your mind at ease and build locally if that’s a major concern of yours?

1 more reply

m4632y ago

I associate brew with macos (where a 3090 would not venture)

But it seems like there's a linux brew.

CuriouslyC2y ago· 12 in thread

Given you're running a 3090 24gb, go with Oobabooga/Sillytavern, and don't come here for advice on this stuff, go to https://www.reddit.com/r/LocalLLaMA/, they usually have a "best current local model" thread pinned, and you're less likely to get incorrect advice.

pluc2y ago

Non-cancer link: https://old.reddit.com/r/LocalLLaMA/ (though that won't work on VPNs now)

(I just got a little emotional because these are things we used to say on reddit and now we say them about reddit. How the mighty have fallen)

notjulianjaynes2y ago

> (though that won't work on VPNs now)

Did for me just now, although as of a week or two ago reddit has been blocking many of my attempts to access through a VPN old. or not. Usually need to reconnect about 3 or 4 times before a page will load.

tommy_axle2y ago

VPNs were being blocked over the weekend (I checked after seeing https://news.ycombinator.com/item?id=39883747) but as of today within the last hour or so it seems like it works again for both old.redddit.com and regular reddit.com so something changed.

razodactyl2y ago

I miss when redditors were really nice xD That was a while ago now. (Or maybe people just make me more nervous these days lol)

Cheezmeister2y ago

RFC: Is there some straightforward way to use a Pi-hole like setup to 302 redirect `reddit.com` to `old.reddit.com`?

I'm sure there are myriad browser extensions that will do it at the DOM level, but that's such a heavy-handed solution, and also lol I'm not putting an extension on the cartesian product of all my browsers on all my machines in the service of dis-enshittifying one once-beloved social network.

3 more replies

fsckboy2y ago

>Given you're running a 3090 ... "best current local model" thread pinned, and you're less likely to get incorrect advice

why do you think my info about the 3090 https://en.wikipedia.org/wiki/IBM_3090 is going to be anything less than up-to-date?

m4632y ago

ibm 3090 makes 3 slots seem quite svelte.

on the other hand, 24gb of late 80's memory... how many acres of raised floor data center would that take?

brushfoot2y ago

That doesn't seem to be the case, at least right now. There's no pinned post as far as I can tell, and searching the sub for `best model` returns mostly questions.

CuriouslyC2y ago

It might be because a lot of people use chatbot arena leaderboard for that purpose now, check it out here and look for local models:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

1 more reply

aliasaria2y ago

If you are looking for a downloadable app as an alternative to Oobabooga / Sillytavern, we built https://github.com/transformerlab/transformerlab-app as a way to easily download, interact, and even train local llms.

Der_Einzige2y ago

The reason everyone loves oobabooga is that it’s made with a maximalist design paradigm.

Everyone in the VC world misunderstands why oobabooga is successful and tries to embrace not maximalism.

Your example product to benchmark yourself against is blender, if you want to serious compete against oobabooga. You need maximalism

1 more reply

downrightmike2y ago

Just let reddit die already, the VPN blocking is just bullshit.

mateuszbuda2y ago· 6 in thread

Anyone can share experience with https://ollama.com/ ?

bovem2y ago

I love it. It is easy to install and containerized.

Its API is great if you want to integrate it with your code editor or create your own applications.

I have written a blog [1] on the process of deployment and integration with neovim and vscode.

I also created an application [2] to chat with LLMs by adding the context of a PDF document.

Update: I would like to add that because the API is simple and Ollama is now available on Windows I don’t have to share my GPU between multiple VMs to interact with it.

[1] https://www.avni.sh/posts/homelab/self-hosting-ollama/ [2] https://github.com/bovem/chat-with-doc

notjulianjaynes2y ago

Stupid easy. I was speaking with an LLM after pasting two lines into terminal.

cgopalan2y ago

I use it on my 2015 Macbook pro. Its amazing how quickly you can get set up, kudos to the authors. Its a dog in terms of response time for questions, but that's expected with my machine configuration.

Also they have Python (and less relevant to me) Javascript libraries. So I assume you dont have to go through LangChain anymore.

rib3ye2y ago

Installing additional LLMs is a single command. I am currently loving Dolphin.

jdwyah2y ago

super easy. fun to play with. fast.

we screwed around with it on a live stream: https://www.youtube.com/live/3YhBoox4JvQ?si=dkni5LY3EALnWVuE...

If you're writing something that will run on someone's local machine I think we're at the point where you can start building with the assumption that they'll have a local, fast, decent LLM.

auggierose2y ago

> If you're writing something that will run on someone's local machine I think we're at the point where you can start building with the assumption that they'll have a local, fast, decent LLM.

I don't believe that at all. I don't have any kind of local LLM. My mother doesn't, either. Nor does my sister. My girl-friend? Nope.

teeray2y ago· 4 in thread

Also, what are the recommended hardware options these days?

notjulianjaynes2y ago

If you're broke, get a refurbished/used Nvidia P40. Same amount of vram as a 3090, between 4 and 10 times cheaper depending on how cheap you can find a 3090.

Granted it's slower of course, but best bang for your buck on vram, so you can run larger models than on a smaller bit faster card might be able to. (Not an expert.)

Edit: if using in desktop tower, you'll need to cool it somehow. I'm using a 3D printed fan thingy, but some people have figured out how to use a 1080 ti APU cooler with it too.

1 more reply

aliasaria2y ago

If you're able to purchase a separate GPU, the most popular option is to get an NVIDIA RTX3090 or RTX4090.

Apple Mac M2 or M3's are becoming a viable option because of MLX https://github.com/ml-explore/mlx . If you are getting an M series Mac for LLMs, I'd recommend getting something with 24GB or more of RAM.

ein0p2y ago

You don’t need MLX for this. Ollama, which is based on llama.cpp is GPU accelerated on a Mac. In particular it has better performance on quantized models. MLX can be used for eg fine tuning etc. It’s a bit faster than PyTorch for that.

whimsicalism2y ago

mlx is not super relevant

keriati12y ago· 3 in thread

We run coding assistance models on MacBook Pros locally, so here is my experience: On hardware side I recommend Apple M1 / M2 / M3 with at least 400Gb/s memory bandwidth. For local coding assistance this is perfect for 7B or 33B models.

We also run a Mac Studio with a bigger model (70b), M2 ultra and 192GB ram, as a chat server. It's pretty fast. Here we use Open WebUI as interface.

Software wise Ollama is OK as most IDE plugins can work with it now. I personally don't like the go code they have. Also some key features are missing from it that I would need and those are just never getting done, even as multiple people submitted PRs for some.

LM Studio is better overall, both as server or as chat interface.

I can also recommend CodeGPT plugin for JetBrains products and Continue plugin for VSCode.

As a chat server UI as I mentioned Open WebUI works great, I use it with together ai too as backend.

isoprophlex2y ago

An M2 ultra with 192 gb isn't cheap, did you have it lying around for whatever reason or do you have some very solid business case for running the model locally/on prem like that?

Or maybe I'm just working in cash poor environments...

Edit: also, can you do training / finetuning on an m2 like that?

keriati12y ago

We had some as build agent around already. We don't plan to do any fine tuning or training, so we did not explore this at all. However I don't think it is a viable option.

shostack2y ago

Can the Continue plugin handle multiple files in a directory of code?

bluedino2y ago· 3 in thread

Newbie questions:

What do you do with one of these?

Does it generate images? Write code? Can you ask it generic questions?

Do you have to 'train' it?

Do you need a large amount of storage to hold the data to train the model on?

aliasaria2y ago

You can chat with the models directly in the same way you can chat with GPT 3.5.

Many of the opensource tools that run these models let you also edit the system prompt, which lets you tweak their personality.

The more advanced tools let you train them, but most of the time, people are downloading pre-existing models and using them directly.

If you are training models, it depends what you are doing. Finetuning an existing pre-trained model requires lots of examples but you can often do a lot with, say, 1000 examples in a dataset.

If you are training a large model completely from scratch, then, yes, you need tons of data and very few people are doing that on their local machines.

montgomery_r2y ago

+1 on these questions. Can I run a local llm that will, for example - visit specified URLs and collect tabular data into csv format? - ingest a series of documents on a topic and answer questions about it - ingest all my PDF/MD/Word docs and answer questions about them?

aliasaria2y ago

Some of the tools offer a path to doing tool use (fetching URLs and doing things with them) or RAG (searching your documents). I think Oobabooga https://github.com/oobabooga/text-generation-webui offers the latter through plugins.

Our tool, https://github.com/transformerlab/transformerlab-app also supports the latter (document search) using local llms.

notjulianjaynes2y ago· 2 in thread

Jumping on this as a fellow idiot to ask for suggestions on having a local LLM generate a summary from a transcript with multiple speakers. How important is it that the transcript is well formatted (diarization, etc) first. My attempts have failed miserably thus far.

Edit: using a P40, whisper as ASR

petronic2y ago

My experience with more traditional (non-Whisper-based) diarization & transcription is that it's heavily dependent on how well the audio is isolated. In a perfect scenario (one speaker per audio channel, well-placed mics) you'll potentially see some value from it. If you potentially have scenarios where the speaker's audio is being mixed with other sounds or music, they'll often be flagged as an additional speaker (so, speaker 1 might also be speaker 7 and 9) - which can make for a less useful summary.

notjulianjaynes2y ago

You're saying that diarization quality is dependent on speaker isolation? I should have clarified, to my knowledge whisper does not perform that step, and whether I need to do diarization is what I'm trying to figure out. (Probably I do.) Pyannote.audio has been suggested, but I ran into some weird dependency thing I didn't feel like troubleshooting late last night, so I have not been successful in using it yet.

chasd002y ago· 1 in thread

If you want a code centric interface it's pretty easy to use llangchain with local models. For example, you specify a model from hugging face and it will download and run it locally.

https://python.langchain.com/docs/get_started/introduction

I like llangchain but it can get complex for use cases beyond a simple "give the llm a string, get a string back". I've found myself spending more time in llangchain docs than working on my actual idea/problem. However, it's still a very good framework and they've done an amazing job IMO.

edit: "Are options ‘idiot proof’ yet?" - from my limited experience, Ollama is about as straightforward as it gets.

ShamelessC2y ago

Why are you using two l’s to spell langchain?

deathmonger50002y ago· 1 in thread

I created a tool called Together Gift It because my family was sending an insane amount of gift related group texts during the holidays. My favorite one was when someone included the gift recipient in the group text about what gift we were getting for the person.

Together Gift It solves the problem the way you’d think: with AI. Just kidding. It solves the problem by keeping everything in one place. No more group texts. There are wish lists and everything you’d want around that type of thing. There is also AI.

https://www.togethergiftit.com/

deathmonger50002y ago

Sorry wrong thread :(

bababuriba2y ago

LMStudio has an easy interface, you can browse/search for models, it has a server if you need an API, etc.

it's pretty 'idiot proof', if you ask me.

https://lmstudio.ai

jillesvangurp2y ago

Gpt4all and Simon Willison's llm python tool are a nice way to get started; even on a modest laptop. A modest 14" mac book pro with 16GB goes a long way with most of the 7B models. Anything larger you need more ram.

ein0p2y ago

FWIW I use Ollama and Open WebUI. Ollama uses one of my RTX A6000 GPUs. I also use Ollama on my M3 MacBook Pro. Open WebUI even supports multimodal models.

theshrike792y ago

I use LM Studio[0] on my M-series macs and it's pretty plug and play.

I've got an Ollama instance running on a VPS providing a backend for a discord bot.

[0] https://lmstudio.ai

TachyonicBytes2y ago

I still feel that llamafiles[1] are the easiest way to do this, on most architectures. It's basically just running a binary with a few command-line options, seems pretty close to what you describe you want.

[1] https://github.com/Mozilla-Ocho/llamafile

ActorNightly2y ago

Ollama is the easiest. For coding, use VSCode Continue extension and point it to ollama server.

The thing to watch out for (if you have exposable income) is new RTX 5090. Rumors are floating they are going to have 48gb of ram per card. But if not, the ram bandwidth is going to be a lot faster. People who are on 4090 or 3090s doing ML are going to go to those, so you can pick up a second 3090 for cheap at which point you can load higher parameter models, however you will have to learn hugging face Accelerate library to support multi gpu inference (not hard, just some reading trial/error).

nlittlepoole2y ago

https://jan.ai/ is pretty idiot proof.

0xbadc0de52y ago

https://lmstudio.ai/ + Mixtral 8x7b

borissk2y ago

Interesting question, I'd like to know this also.

Guess it's going to be a variant of Llama or Grok.

whimsicalism2y ago

It depends if you want ease or speed and if you are batching.

Ease? Probably ollama

Speed and you are batching on gpu? vLLM

EarthAmbassador2y ago

Where can we find detailed info, or better a tool, to determine gear compatability?

_ea1k2y ago

ollama is the easiest, IMO. The CLI interface is pretty good too.

gpt4all is decent as well, and also provides a way to retrieve information from local documents.

LorenDB2y ago

Ollama is about as idiot proof as you'll get.

resource_waste2y ago

oobabooga + berkley sterling LM

Seriously, this is the insane duo that can get you going in moments with chatgpt3.5 quality.

tamarlikesdata2y ago

Hugging Face Transformers is your best bet. It's pretty straightforward and has solid docs, but you'll need to get your hands dirty a bit with setup and configs.

For squeezing every bit of performance out of your GPU, check out ONNX or TensorRT. They're not exactly plug-and-play, but they're getting easier to use.

And yeah, Docker can make life a bit easier by handling most of the setup mess for you. Just pull a container and you're more or less good to go.

It's not quite "idiot-proof" yet, but it's getting there. Just be ready to troubleshoot and tinker a bit.

database-theory2y ago

Paywall article: https://towardsdatascience.com/how-to-build-a-local-open-sou...

Source code: https://github.com/leoneversberg/llm-chatbot-rag

benreesman2y ago

The current HYPERN // MODERN // AI builds are using flox 1.0.2 to install llama.cpp. The default local model is dolphin-8x7b at ‘Q4_KM’ quantization (it lacks defaults for Linux/NXIDIA, that’s coming soon and it works, you just have to configure it manually and Mac gets more love because that’s what myself and the other main contributors have).

flox will also install properly accelerated torch/transformers/sentence-transfomers/diffusers/etc: they were kind enough to give me a preview of their soon-to-be-released SDXL environment suite (please don’t hold them to my “soon”, I just know it looks close to me). So you can do all the modern image stuff pretty much up to whatever is on HuggingFace.

I don’t have the time I need to be emphasizing this, but the last piece before I’m going to open source this is I’ve got a halfway decent sketch of a binary replacement/conplement for the OpenAI-compatible JSON/HTTP one everyone is using now.

I have incomplete bindings to whisper.cpp and llama.cpp for those modalities, and when it’s good enough I hope the bud.build people will accept it as a donation to the community managed ConnectRPC project suite.

We’re really close to a plausible shot at open standards on this before NVIDIA or someone totally locks down the protocol via the RT stuff.

edit: I almost forgot to mention. We have decent support for multi-vendor, mostly in practice courtesy of the excellent ‘gptel’, though both nvim and VSCode are planned for out-of-the-box support too.

The gap is opening up a bit again between the best closed and best open models.

This is speculation but I strongly believe the current opus API-accessible build is more than a point release, it’s a fundamental capability increase (though it has a weird BPE truncation issue that could just be a beta bug, but it could hint at something deeper.

It can produce verbatim artifacts from 100s of thousands of tokens ago and restart from any branch in the context, takes dramatically longer when it needs to go deep, and claims it’s accessing a sophisticated memory hierarchy. Personally I’ve never been slackjawed with amazement on anything in AI except my first night with SD and this thing.

j / k navigate · click thread line to collapse

88 comments

71 comments · 26 top-level

aantix2y ago· 13 in thread

Ollama is really easy.

brew install ollama

brew services start ollama

ollama pull mistral

Ollama you can query via http. It provides a consistent interface for prompting, regardless of model.

https://github.com/ollama/ollama/blob/main/docs/api.md#reque...

brushfoot2y ago

For Windows users without brew, there's a Windows installer:

https://ollama.com/download/windows

WinGet and Scoop apparently also have it. Chocolatey doesn't seem to.

ActorNightly2y ago

Works very well in WSL2 as well. I prefer that as you have to start the server manually, so it doesn't sit in the background.

crazypython2y ago

It lacks batching support (n>1), unfortunately, which is necessary for Loom-like applications

aantix2y ago

Loom - this application?

https://medium.com/@tofujoy77/loom-ai-uncovering-creative-wr...

brushfoot2y ago

How does Ollama distribute these models? The downloads page on the Llama website, https://llama.meta.com/llama-downloads/, has a heading that reads:

> Request access to Llama

Which to me gives the impression that access is gated and by application only. But Ollama downloaded it without so much as a "y".

Is that just Meta's website UI? Registration isn't actually required?

cymor2y ago

It doesn't distribute. The app pulls from hugging face.

1 more reply

whimsicalism2y ago

because this field is a baby, noncompliance is rampant

eatonphil2y ago

Zambyte2y ago

joenot4432y ago

> I didn't see anything about this in the repo that everything definitely stays local.

If this is your concern, I'd encourage you to read the code yourself. If you find it meets the bar you're expecting, then I'd suggest you submit a PR which updates the README to answer your question.

okwhateverdude2y ago

aantix2y ago

It is free.. So why not just audit the source personally to put your mind at ease and build locally if that’s a major concern of yours?

1 more reply

m4632y ago

I associate brew with macos (where a 3090 would not venture)

But it seems like there's a linux brew.

CuriouslyC2y ago· 12 in thread

pluc2y ago

Non-cancer link: https://old.reddit.com/r/LocalLLaMA/ (though that won't work on VPNs now)

(I just got a little emotional because these are things we used to say on reddit and now we say them about reddit. How the mighty have fallen)

notjulianjaynes2y ago

> (though that won't work on VPNs now)

tommy_axle2y ago

razodactyl2y ago

I miss when redditors were really nice xD That was a while ago now. (Or maybe people just make me more nervous these days lol)

Cheezmeister2y ago

RFC: Is there some straightforward way to use a Pi-hole like setup to 302 redirect `reddit.com` to `old.reddit.com`?

3 more replies

fsckboy2y ago

>Given you're running a 3090 ... "best current local model" thread pinned, and you're less likely to get incorrect advice

why do you think my info about the 3090 https://en.wikipedia.org/wiki/IBM_3090 is going to be anything less than up-to-date?

m4632y ago

ibm 3090 makes 3 slots seem quite svelte.

on the other hand, 24gb of late 80's memory... how many acres of raised floor data center would that take?

brushfoot2y ago

That doesn't seem to be the case, at least right now. There's no pinned post as far as I can tell, and searching the sub for `best model` returns mostly questions.

CuriouslyC2y ago

It might be because a lot of people use chatbot arena leaderboard for that purpose now, check it out here and look for local models:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

1 more reply

aliasaria2y ago

Der_Einzige2y ago

The reason everyone loves oobabooga is that it’s made with a maximalist design paradigm.

Everyone in the VC world misunderstands why oobabooga is successful and tries to embrace not maximalism.

Your example product to benchmark yourself against is blender, if you want to serious compete against oobabooga. You need maximalism

1 more reply

downrightmike2y ago

Just let reddit die already, the VPN blocking is just bullshit.

mateuszbuda2y ago· 6 in thread

Anyone can share experience with https://ollama.com/ ?

bovem2y ago

I love it. It is easy to install and containerized.

Its API is great if you want to integrate it with your code editor or create your own applications.

I have written a blog [1] on the process of deployment and integration with neovim and vscode.

I also created an application [2] to chat with LLMs by adding the context of a PDF document.

Update: I would like to add that because the API is simple and Ollama is now available on Windows I don’t have to share my GPU between multiple VMs to interact with it.

[1] https://www.avni.sh/posts/homelab/self-hosting-ollama/ [2] https://github.com/bovem/chat-with-doc

notjulianjaynes2y ago

Stupid easy. I was speaking with an LLM after pasting two lines into terminal.

cgopalan2y ago

Also they have Python (and less relevant to me) Javascript libraries. So I assume you dont have to go through LangChain anymore.

rib3ye2y ago

Installing additional LLMs is a single command. I am currently loving Dolphin.

jdwyah2y ago

super easy. fun to play with. fast.

we screwed around with it on a live stream: https://www.youtube.com/live/3YhBoox4JvQ?si=dkni5LY3EALnWVuE...

If you're writing something that will run on someone's local machine I think we're at the point where you can start building with the assumption that they'll have a local, fast, decent LLM.

auggierose2y ago

> If you're writing something that will run on someone's local machine I think we're at the point where you can start building with the assumption that they'll have a local, fast, decent LLM.

I don't believe that at all. I don't have any kind of local LLM. My mother doesn't, either. Nor does my sister. My girl-friend? Nope.

teeray2y ago· 4 in thread

Also, what are the recommended hardware options these days?

notjulianjaynes2y ago

If you're broke, get a refurbished/used Nvidia P40. Same amount of vram as a 3090, between 4 and 10 times cheaper depending on how cheap you can find a 3090.

Granted it's slower of course, but best bang for your buck on vram, so you can run larger models than on a smaller bit faster card might be able to. (Not an expert.)

Edit: if using in desktop tower, you'll need to cool it somehow. I'm using a 3D printed fan thingy, but some people have figured out how to use a 1080 ti APU cooler with it too.

1 more reply

aliasaria2y ago

If you're able to purchase a separate GPU, the most popular option is to get an NVIDIA RTX3090 or RTX4090.

ein0p2y ago

whimsicalism2y ago

mlx is not super relevant

keriati12y ago· 3 in thread

We also run a Mac Studio with a bigger model (70b), M2 ultra and 192GB ram, as a chat server. It's pretty fast. Here we use Open WebUI as interface.

LM Studio is better overall, both as server or as chat interface.

I can also recommend CodeGPT plugin for JetBrains products and Continue plugin for VSCode.

As a chat server UI as I mentioned Open WebUI works great, I use it with together ai too as backend.

isoprophlex2y ago

An M2 ultra with 192 gb isn't cheap, did you have it lying around for whatever reason or do you have some very solid business case for running the model locally/on prem like that?

Or maybe I'm just working in cash poor environments...

Edit: also, can you do training / finetuning on an m2 like that?

keriati12y ago

We had some as build agent around already. We don't plan to do any fine tuning or training, so we did not explore this at all. However I don't think it is a viable option.

shostack2y ago

Can the Continue plugin handle multiple files in a directory of code?

bluedino2y ago· 3 in thread

Newbie questions:

What do you do with one of these?

Does it generate images? Write code? Can you ask it generic questions?

Do you have to 'train' it?

Do you need a large amount of storage to hold the data to train the model on?

aliasaria2y ago

You can chat with the models directly in the same way you can chat with GPT 3.5.

Many of the opensource tools that run these models let you also edit the system prompt, which lets you tweak their personality.

The more advanced tools let you train them, but most of the time, people are downloading pre-existing models and using them directly.

If you are training models, it depends what you are doing. Finetuning an existing pre-trained model requires lots of examples but you can often do a lot with, say, 1000 examples in a dataset.

If you are training a large model completely from scratch, then, yes, you need tons of data and very few people are doing that on their local machines.

montgomery_r2y ago

aliasaria2y ago

Our tool, https://github.com/transformerlab/transformerlab-app also supports the latter (document search) using local llms.

notjulianjaynes2y ago· 2 in thread

Edit: using a P40, whisper as ASR

petronic2y ago

notjulianjaynes2y ago

chasd002y ago· 1 in thread

If you want a code centric interface it's pretty easy to use llangchain with local models. For example, you specify a model from hugging face and it will download and run it locally.

https://python.langchain.com/docs/get_started/introduction

edit: "Are options ‘idiot proof’ yet?" - from my limited experience, Ollama is about as straightforward as it gets.

ShamelessC2y ago

Why are you using two l’s to spell langchain?

deathmonger50002y ago· 1 in thread

https://www.togethergiftit.com/

deathmonger50002y ago

Sorry wrong thread :(

bababuriba2y ago

LMStudio has an easy interface, you can browse/search for models, it has a server if you need an API, etc.

it's pretty 'idiot proof', if you ask me.

https://lmstudio.ai

jillesvangurp2y ago

ein0p2y ago

FWIW I use Ollama and Open WebUI. Ollama uses one of my RTX A6000 GPUs. I also use Ollama on my M3 MacBook Pro. Open WebUI even supports multimodal models.

theshrike792y ago

I use LM Studio[0] on my M-series macs and it's pretty plug and play.

I've got an Ollama instance running on a VPS providing a backend for a discord bot.

[0] https://lmstudio.ai

TachyonicBytes2y ago

[1] https://github.com/Mozilla-Ocho/llamafile

ActorNightly2y ago

Ollama is the easiest. For coding, use VSCode Continue extension and point it to ollama server.

nlittlepoole2y ago

https://jan.ai/ is pretty idiot proof.

0xbadc0de52y ago

https://lmstudio.ai/ + Mixtral 8x7b

borissk2y ago

Interesting question, I'd like to know this also.

Guess it's going to be a variant of Llama or Grok.

whimsicalism2y ago

It depends if you want ease or speed and if you are batching.

Ease? Probably ollama

Speed and you are batching on gpu? vLLM

EarthAmbassador2y ago

Where can we find detailed info, or better a tool, to determine gear compatability?

_ea1k2y ago

ollama is the easiest, IMO. The CLI interface is pretty good too.

gpt4all is decent as well, and also provides a way to retrieve information from local documents.

LorenDB2y ago

Ollama is about as idiot proof as you'll get.

resource_waste2y ago

oobabooga + berkley sterling LM

Seriously, this is the insane duo that can get you going in moments with chatgpt3.5 quality.

tamarlikesdata2y ago

Hugging Face Transformers is your best bet. It's pretty straightforward and has solid docs, but you'll need to get your hands dirty a bit with setup and configs.

For squeezing every bit of performance out of your GPU, check out ONNX or TensorRT. They're not exactly plug-and-play, but they're getting easier to use.

And yeah, Docker can make life a bit easier by handling most of the setup mess for you. Just pull a container and you're more or less good to go.

It's not quite "idiot-proof" yet, but it's getting there. Just be ready to troubleshoot and tinker a bit.

database-theory2y ago

Paywall article: https://towardsdatascience.com/how-to-build-a-local-open-sou...

Source code: https://github.com/leoneversberg/llm-chatbot-rag

benreesman2y ago

We’re really close to a plausible shot at open standards on this before NVIDIA or someone totally locks down the protocol via the RT stuff.

The gap is opening up a bit again between the best closed and best open models.

j / k navigate · click thread line to collapse