Ask HN: Which LLMs can run locally on most consumer computers

75 pointsFezzikTheGiant2y ago94 comments

Are there any? I was thinking about LLM based agents and games and this will probably only be viable when most devices can handle LLMs running locally.

94 comments

70 comments · 23 top-level

MyFirstSass2y ago· 12 in thread

I've been curious as to when games would implement any kind of these new technologies, but i think they are simply too slow for now?

I think we're at least 10-15 years from being able to run low latency agents that "rag" themselves into the games they are a part of, where there are 100's of them, some of them NPC's other's controlling some game mechanic or checking if the output from other agents is acceptable or needs to be run again.

At the moment a macbook air 16 gb can run Phi-Medium 14gb, which is extremely impressive, but it's 7 tokens per second, way to slow for any kind of gaming, you need to 100x performance and we need 5+ generations before i can see this happening.

Unless there's some other application?

reaperman2y ago

> for games: i think they are simply too slow for now?

I think it's two-fold. The primary one is that it's likely very difficult to maintain a designers storyline vision and desired "atmosphere / feel", because LLM's currently "go off the rails" too easily. The second is that the teams with enough funding to properly fine-tune generative AI to do dialog, level/environment-creation, character-generation, etc. that funding means they're generally making AAA or AAA-adjacent games, which already need so much of a consumer GPU VRAM that there's not a lot left over for large ML models to run in parallel.

I do think though that we should already be seeing indie games doing more with LLM's and 3D character/level/item generation than we are. Of course AI Dungeon has been trailblazing this for a long time but I just expected to see more widely-recognized success by now from many projects. I take this as a signal that it's hard to make a "good" game using AI generation. If anyone has any suggestions for open-world games with significant amount of AI generation that allows player interaction to significantly affect the in-game universe, I'd be very interested in play-testing them. Can be any genre / style / budget. I just want to see more of what people are accomplishing in this space.

My hope is that there will be space for both the current style of game where every aspect is created/designed by a human, as well as for games of various types where the world is given an overall narrative/aesthetic/vision by the creators, but the details are implemented by AI and allows true open-world play where you finally can just walk into any shop and use RAG/etc to allow complete continuity over months/years of play where characters remember your conversations/interactions/actions of you and anyone playing in the same world.

I do think there's something of an "end-game" for this where a game is released that has no game at all in it, but rather generates games for each player based on what they want to play that day, and creates them as you play them. But I'd like to imagine that this won't replace other games (even if it does take a bit of the air out of the room), but rather exist alongside games with human-curated experiences.

everforward2y ago

I think any NPC with dialogue important to a goal (a quest, a tutorial, etc) is going to be hard to use generative AI for. It not only needs to be coherent with the story, but it needs to correctly include certain ideas. I.e. if the NPC gives a quest to go find some item at some location, it needs to say what the item is and where it is.

I think we're currently stuck in a local minima where AI isn't up to the task of making a coherent player-interactable world, but an incoherent or fragmented and non-interactable world isn't impressive enough (like No Man's Sky).

3 more replies

daemon_90092y ago

Current games which are using LLMs only activate the model when the user is talking to the NPC, but in order to create a real dynamic story which is completely random but to the point, the agents need to interact with other as well,so lets say there are around 100 agents in the game they need to interact with each other to generate some emergent behavior. The form of interaction can be questioned here. will it be in natural lang? or just some embeddings or states.

But this thing still has a long way to go.

IXCoach2y ago

I agree in the context of LLMs running locally. For API connected games, cloud support for nuanced conversations would be a tremendous value add. Take a hit like Cyberpunk, create a Mod that wires into a custom AI from ixcoach.com... we could literally integrate the most nuanced self inquiry practices into the top games this way.

Anyone working on top games through mods that wants to explore this, let me know, Next AI Labs would be interested in supporting such efforts.

wing-_-nuts2y ago

There are mods for skyrim right now that run an NPC's dialog and lore through a small 7B model outputs text dialog. Heck if you wanted you could run a 2B whisper model and get reasonably decent voice output.

It's all very exciting, if a little janky.

pants22y ago

If we're just talking about NPCs in a video game, I bet the game studios have the resources to train a very specific LLM optimized for NPCs. Lots of training data could probably be stripped out; after all your average quest-giver in Skyrim doesn't need to know how to implement Black Scholes in Rust.

imtringued2y ago

The problem is that you need two GPUs and the AI one can't be from AMD. We aren't 15 years away. More like two or three. NPUs are coming and DDR6 plus quad channel memory would get you decent performance on small LLMs like llama3.

You're also forgetting that batch performance is already an order of magnitude better than single session inference.

FezzikTheGiantOP2y ago

I agree on the most part, but I still think some pretty cool games can come up with local LLMs. Suck up for example, though not local afaik, is a pretty cool one.

phi-go2y ago

There are a few games that use LLMs and voice, they are usually hilariously janky.

FezzikTheGiantOP2y ago

Could you name some?

antisthenes2y ago

How in the world would this be tested? Anything pertaining to game logic needs to be deterministic.

I can't see LLMs in games being used for anything more than some random NPC voice quips. And whose voice would be used? Would voice actors be okay with this?

There are already too many bad games, we certainly don't need thousands more with AI-generated drivel dialogue, although having human writers is not a panacea either way.

pants22y ago

Have other AI agents test the game in thousands of scenarios. Voice actors are not needed, SOTA TTS systems can synthesize a brand new voice from a description.

Isuckatcode2y ago· 6 in thread

I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.

[1] https://ollama.com

FezzikTheGiantOP2y ago

Are they able to run at a good speed? I'm just wondering what the economics would look like if I want to create agents in my games. I don't think many are going to be willing to get with usage based / token based pricing. That's the biggest roadblock with building LLM-based games right now.

Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.

verdverm2y ago

> Are they able to run at a good speed?

Not on most consumer computers, which likely lack a dedicated GPU. My M2 struggles (only thing that makes it warm) with a 7B model, but token speed is unbearable. I switched to remote APIs for the speed.

If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.

> This would virtually make inference free right?

Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.

If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.

I suspect that remote LLM calls with sophisticated caching (cross user / convo / pre-gen'd) is something worth exploring as well. IIRC, people suspected gtp3-turbo was caching common queries and avoided the LLM when it could, for the speed

sharpshadow2y ago

You could also ship a couple of them and let the game/user choose which one to run depending on the hardware.

1 more reply

imtringued2y ago

There isn't. For games you would need vLLM, because batch size is more important than latency. Something that people don't seem to understand is that an NPC doesn't need to generate tokens faster than its TTS can speak. You only need to minimize the time to first token.

alexvitkov2y ago

The biggest roadblock is not running the model on the user's machine, that's barely an issue with 7B models on a gaming PC. The difficulty is in getting the NPC to take interesting actions with a tangible effect on the game world as a result of their conversation with the player.

1 more reply

Isuckatcode2y ago

Here [1] is a reference to the token/sec of Llama 3 on different apple hardware. You can evaluate if this is an acceptable performance for your agents. I would assume the token/sec would be much lower if the LLM agent is running along the side as the game would also be using a portion of the CPU and GPU. I think this is something that you need to test out on your own to determine its usability.

You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.

>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.

[1] https://github.com/ggerganov/llama.cpp/discussions/4167

2 more replies

keiferski2y ago· 5 in thread

Is there any validity to the idea of using a higher-level LLM to generate the initial data, and then copying that data to a lower-level LLM for actual use?

For example, another comment asked:

"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"

So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.

StrauXX2y ago

Maybe. You'd need to develop such a "more efficient" format. Turning unstructured text into knowledge graphs has gotten attention lately. Though I'm honestly skeptical of how useful those will turn out to be. Often times you just can't break down unstructured data into structured data without loosing a ton of information. Turning the data into an intermediary, not directly understandable by humans (say very-high density embeddings) format might be a more promising path.

abdullin2y ago

Yes, this can work. I’ve done that in a few cases.

In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.

kkielhofner2y ago

There is actually a specific approach of this concept for generating synthetic data for training datasets called UDAPDR[0].

It or something like it could likely be applied to any form of generation including what you are describing.

[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...

kevinkeller2y ago

Yes, this model works in many cases.

For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.

thibaut_barrere2y ago

Yes, that is what I am doing on some projects

bryanlarsen2y ago· 4 in thread

Related question: what's the minimum GPU that's roughly equivalent to Microsoft's Copilot+ spec NPU?

I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.

tda2y ago

I was looking out for a new laptop but was wondering the same. This NPU thing might be one of Microsoft's bets that pays off, and makes all pre-NPU hardware obsolete quickly. Though of course they have doubled down on various failed projexts before (arm Windows, windows phones, etc)

kevinkeller2y ago

The NPU in the Snapdragon SoC used by the Windows Surface laptops was quoted to be ~ 40 trillion ops/s (TOPS).

Nvidia 4070 Ti has roughly the same performance: https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...

Of course, I'm massively oversimplifying, but it should be in the ballpark.

artemisart2y ago

No, the Nvidia 4070 Ti has much higher performance, TOPS is for integer operations, the 4070 Ti has ~40 float32 TFLOPS and 641 TOPS https://www.nvidia.com/fr-fr/geforce/graphics-cards/40-serie... (which I would say would be peak TOPS for int4 operations, comparing it to the 4080 datasheet, and a bit more than half that for int8 operations) https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid... page 34. I did not find the datasheet for 4070 Ti.

imtringued2y ago

Basically any GPU with at least 32GB RAM and 12 TFLOPs.

blakesterz2y ago· 3 in thread

Maybe a dumb question, but I think anyone reading this question would know a good answer for me. If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally? "Best" in this case would be I would want to get the best/smartest answers from my questions about these PDFs. They're all full-text PDFs, studies and results on a specific genetic condition that I'd like to understand better by asking something smart questions.

verdverm2y ago

LlamaIndex can make this task possible in a very few (surprisingly few) lines of code: https://docs.llamaindex.ai/en/stable/understanding/putting_i...

You'll likely want to move beyond the first examples so you can choose models & methods. Either way, LI has tons of great documentation and was originally built for this purpose. They also have a commercial Parsing product with very generous free quotas (last I checked)

manishsharan2y ago

If its just for you, may I suggest Open AI's python notebook examples. This was the one I used to get started.

https://cookbook.openai.com/examples/parse_pdf_docs_for_rag

There are several other examples like this .. but I got stuck in jargon of Langchain or LlamaIndex etc..

solardev2y ago

Not self hosted, but Google Notebook LLM is OK at that: https://notebooklm.google.com/

You can also upload files to ChatGPT and ask questions about it.

wing-_-nuts2y ago· 3 in thread

The general rule is that VRAM == parameter count in billions (I'm generalizing gguf finetunes here)

8GB vram cards can run 7B models

16GB vram cards can run 13B models

24GB vram cards can run up to 33B models

Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.

noboostforyou2y ago

Perhaps there's a simple explanation but why does 24GB of VRAM offer such a large relative uplift in parameter count? (is memory bandwidth a factor rather than just the total memory amount?)

wing-_-nuts2y ago

So, this is a bit misleading. For whatever reason the models tend to be released in certain parameter sizes. 7B models are popular. The next highest is 13B. There are few in between (some 11B). Likewise the jump from 13 is straight to 33B. You can run finetunes of a 33B model that have been cut down a little and fit them in a 24GB card. Likewise those 13B models running on 16GB cards have a lot of head room. You don't need to run as cut down a model, and you can run it with more context (i.e. the amount of your chat it can hold in memory)

I hope that helps, it's not 1:1, and it's a bit confusing

1 more reply

wkat42422y ago

Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

1 more reply

andy_ppp2y ago· 3 in thread

“Caniuse” equivalent for LLMs depending on machine specs would be extremely useful!

abdullin2y ago

There are too many variables at play, unfortunately.

One can ran local LLMs even on RaspberryPi, although it will be horribly slow.

andy_ppp2y ago

Maybe it wouldn’t be an algorithm, maybe it would be a reporting site where you can review your experience if there’s no way to calculate it.

1 more reply

Terretta2y ago

LM Studio on MacOS provides an estimate of whether a model will run on the GPU, also lets you partially offload.

The underlying CLI tools do this, the app makes it easier to see and manage.

jaggs2y ago· 3 in thread

Mistral is pretty good, and delivers solid results.

FezzikTheGiantOP2y ago

Interesting - is it viable do you think to package a llm like that with an existing game and run it locally - I assume it will be intensive to run but wouldn't that eliminate inference costs?

Werewolf2552y ago

It would be intensive but it's very doable. You could use koboldcpp or something like that with an exposed endpoint just on the local machine and use that. You'll likely run into issues with GPU vendors and ensuring that you've got the right software versions running, but with some checking, it should be viable. Maybe include a fallback in case the system can't produce results in a timely manner.

jaggs2y ago

Why would you get costs with a local model?

1 more reply

calculito2y ago· 2 in thread

I assume the question is rather which LLM can cover most of the tasks while delivering decent quality. I would prefer an architecture using different LLM for different tasks rather like 'specialists' instead of simple 'agents'. I used to take the main task and divide it in smaller tasks and see what can I use to solve the problem. Sometimes rule-based approaches can be already enough for a sub-task and LLM would be not only overkill but also more difficult to implement and maintain.

rahimrezgui2y ago

so what is your answer to the question?

calculito2y ago

Depends of what you want to do!? Just for testing most of the 7B model are a good compromise between quality and performance (speak execution time)

1 more reply

jsheard2y ago· 2 in thread

I imagine you would have to solve some tricky scheduling issues to run an LLM on the GPU while it's also busy rendering the game. Frames need to be rendered at a more or less consistent rate no matter what, but the LLM would likely have erratic, spiky GPU utilisation depending on what the agents are doing, so you would have to throttle the LLM execution very carefully. Probably doable but I don't think there's any existing framework support for that.

callwhendone2y ago

or have 2 gpus

jsheard2y ago

That also works but approximately zero gamers have two discrete GPUs. You can't even rely on users to have an integrated GPU and a discrete GPU, there's a lot of systems which only have one or the other.

ynniv2y ago· 1 in thread

See llamafile (https://github.com/Mozilla-Ocho/llamafile), a standalone packaging of llama.cpp that runs an LLM locally. It will use the GPU, but falls back on the CPU. CPU-only performance of small, quantized models is still pretty decent, and the page lists estimated memory requirements for currently popular models.

ultrasaurus2y ago

+100 to this, I don't think many people reading this thread realize how easy they've made it to run a LLM locally. It's a great start if you want to kick multiple tires (be careful to clean up! the gigs add up).

> wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF...

> chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

> ./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile -ngl 999

https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

onion2k2y ago· 1 in thread

I run Mistral 7b and Llama 3 locally using jani.ai on a 32GB Dell laptop and get about 6 tokens per second with a context window of 8k. It's definitely usable if you're patient. I'm glad I also have a Hugging Face account though.

Liquix2y ago

seconded - IMHO Jan has the cleanest UI and most straightforward setup out of all LLM frontends available now.

https://jan.ai/

https://github.com/janhq/jan

sn0wr8ven2y ago· 1 in thread

There definitely are smaller LLMs that can run on consumer computers, but as for their performance... You would be lucky to get a full sentence. On the other hand, sending and receiving responses as text is probably the fastest and most realistic way to implement these things in games.

imtringued2y ago

I've gone past the 8k context window with very good text generation on llama3. I don't know what you're smoking.

winwang2y ago· 1 in thread

Check out this subreddit for a decent "source of truth": reddit.com/r/localllama

resource_waste2y ago

Nah, too many fanboys thinking their CPU testing is actually using LLMs.

They will say things like "Its a GPU inside a CPU". No that is the marketers telling you about integrated GPUs.

There is a huge divide between CPU and GPU people. GPU people are doing application. CPU people are... happy that they got anything to run.

spmurrayzzz2y ago

Running them at the edge is definitely possible on most hardware, but not ideal by any means. You'll have to set latency and throughput expectations fairly low if you don't have a GPU to utilize. This is why I'd disagree with your statement re: viability — its really going to be most viable if you centralize the inference in a distributed cloud environment off-device.

Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.

CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.

The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

psynister2y ago

Check out Ollama, it's built to run models locally. Llama3 8b runs great locally for me, 70b is very slow. Plenty of options.

b5n2y ago

Quantized 6-8b models run well on consumer GPUs. My concern would be vram limits given you'll likely be expecting the card to do compute _and_ graphics.

Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.

xyc2y ago

I have been using local LLM as a daily driver. Built https://recurse.chat for it. I've used Llama 3, WizardLM 2, Mistral mostly, and sometimes just trying out models from hugging face (Recently added support for adding it from Hugging Face https://x.com/recursechat/status/1794132295781322909)

pshc2y ago

Quantized 4/5-bit 8b models with medium-short context might be shippable. Still, it’s going to require a nice GPU for all that RAM. Plus you would have to support AMD—I would experiment with llama.cpp as it runs on many architectures.

Hope your game doesn’t have a big texture budget.

root_axis2y ago

Seems like there is high potential for some NPC text generation from LLMs, especially a model that is trained to produce NPC dialog alongside discrete data that can be processed to correlate the content of the speech with the state of the game. This is going to be a tough challenge with a lot of room for research and creative approaches to producing immersive experiences. Unfortunately, only single-player and cooperative experiences will be practical for the foreseeable future since its trivial to totally break the immersion with some prompt poisoning.

Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.

talldayo2y ago

Gemma 2B and Phi-3 3B, if you run them at Q4 quantization. I wouldn't bother with anything larger than 4B parameters; you're just not going to be able to reliably expect an end-user to run that size of model on a phone yet.

ilaksh2y ago

You can 100% do that with quantized models that are 8b and below. Take a look at ollama to experiment. For incorporating in a game I would probably use llama.cpp or candle.

The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.

There are a lot more options if you can establish that the user has a 3090 or 4090.

Terretta2y ago

Macbook Pro with 128GB RAM runs Llama 3 70B entirely in memory and on GPU. It's remarkable to have a performant LLM that smart and that fast on a (pro)sumer laptop.

j / k navigate · click thread line to collapse

94 comments

70 comments · 23 top-level

MyFirstSass2y ago· 12 in thread

I've been curious as to when games would implement any kind of these new technologies, but i think they are simply too slow for now?

Unless there's some other application?

reaperman2y ago

> for games: i think they are simply too slow for now?

everforward2y ago

3 more replies

daemon_90092y ago

But this thing still has a long way to go.

IXCoach2y ago

Anyone working on top games through mods that wants to explore this, let me know, Next AI Labs would be interested in supporting such efforts.

wing-_-nuts2y ago

It's all very exciting, if a little janky.

pants22y ago

imtringued2y ago

You're also forgetting that batch performance is already an order of magnitude better than single session inference.

FezzikTheGiantOP2y ago

I agree on the most part, but I still think some pretty cool games can come up with local LLMs. Suck up for example, though not local afaik, is a pretty cool one.

phi-go2y ago

There are a few games that use LLMs and voice, they are usually hilariously janky.

FezzikTheGiantOP2y ago

Could you name some?

antisthenes2y ago

How in the world would this be tested? Anything pertaining to game logic needs to be deterministic.

I can't see LLMs in games being used for anything more than some random NPC voice quips. And whose voice would be used? Would voice actors be okay with this?

There are already too many bad games, we certainly don't need thousands more with AI-generated drivel dialogue, although having human writers is not a panacea either way.

pants22y ago

Have other AI agents test the game in thousands of scenarios. Voice actors are not needed, SOTA TTS systems can synthesize a brand new voice from a description.

Isuckatcode2y ago· 6 in thread

I was able to successfully run Llama 3 8B, mistral 7B, phi and other 7B models using Ollama [1] on my M1 MacBook Air.

[1] https://ollama.com

FezzikTheGiantOP2y ago

Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.

verdverm2y ago

> Are they able to run at a good speed?

If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.

> This would virtually make inference free right?

Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.

If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.

sharpshadow2y ago

You could also ship a couple of them and let the game/user choose which one to run depending on the hardware.

1 more reply

imtringued2y ago

alexvitkov2y ago

1 more reply

Isuckatcode2y ago

You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.

>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?

I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.

[1] https://github.com/ggerganov/llama.cpp/discussions/4167

2 more replies

keiferski2y ago· 5 in thread

Is there any validity to the idea of using a higher-level LLM to generate the initial data, and then copying that data to a lower-level LLM for actual use?

For example, another comment asked:

"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"

StrauXX2y ago

abdullin2y ago

Yes, this can work. I’ve done that in a few cases.

In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.

kkielhofner2y ago

There is actually a specific approach of this concept for generating synthetic data for training datasets called UDAPDR[0].

It or something like it could likely be applied to any form of generation including what you are describing.

[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...

kevinkeller2y ago

Yes, this model works in many cases.

For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.

thibaut_barrere2y ago

Yes, that is what I am doing on some projects

bryanlarsen2y ago· 4 in thread

Related question: what's the minimum GPU that's roughly equivalent to Microsoft's Copilot+ spec NPU?

I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.

tda2y ago

kevinkeller2y ago

The NPU in the Snapdragon SoC used by the Windows Surface laptops was quoted to be ~ 40 trillion ops/s (TOPS).

Nvidia 4070 Ti has roughly the same performance: https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...

Of course, I'm massively oversimplifying, but it should be in the ballpark.

artemisart2y ago

imtringued2y ago

Basically any GPU with at least 32GB RAM and 12 TFLOPs.

blakesterz2y ago· 3 in thread

verdverm2y ago

LlamaIndex can make this task possible in a very few (surprisingly few) lines of code: https://docs.llamaindex.ai/en/stable/understanding/putting_i...

manishsharan2y ago

If its just for you, may I suggest Open AI's python notebook examples. This was the one I used to get started.

https://cookbook.openai.com/examples/parse_pdf_docs_for_rag

There are several other examples like this .. but I got stuck in jargon of Langchain or LlamaIndex etc..

solardev2y ago

Not self hosted, but Google Notebook LLM is OK at that: https://notebooklm.google.com/

You can also upload files to ChatGPT and ask questions about it.

wing-_-nuts2y ago· 3 in thread

The general rule is that VRAM == parameter count in billions (I'm generalizing gguf finetunes here)

8GB vram cards can run 7B models

16GB vram cards can run 13B models

24GB vram cards can run up to 33B models

noboostforyou2y ago

Perhaps there's a simple explanation but why does 24GB of VRAM offer such a large relative uplift in parameter count? (is memory bandwidth a factor rather than just the total memory amount?)

wing-_-nuts2y ago

I hope that helps, it's not 1:1, and it's a bit confusing

1 more reply

wkat42422y ago

Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

1 more reply

andy_ppp2y ago· 3 in thread

“Caniuse” equivalent for LLMs depending on machine specs would be extremely useful!

abdullin2y ago

There are too many variables at play, unfortunately.

One can ran local LLMs even on RaspberryPi, although it will be horribly slow.

andy_ppp2y ago

Maybe it wouldn’t be an algorithm, maybe it would be a reporting site where you can review your experience if there’s no way to calculate it.

1 more reply

Terretta2y ago

LM Studio on MacOS provides an estimate of whether a model will run on the GPU, also lets you partially offload.

The underlying CLI tools do this, the app makes it easier to see and manage.

jaggs2y ago· 3 in thread

Mistral is pretty good, and delivers solid results.

FezzikTheGiantOP2y ago

Interesting - is it viable do you think to package a llm like that with an existing game and run it locally - I assume it will be intensive to run but wouldn't that eliminate inference costs?

Werewolf2552y ago

jaggs2y ago

Why would you get costs with a local model?

1 more reply

calculito2y ago· 2 in thread

rahimrezgui2y ago

so what is your answer to the question?

calculito2y ago

Depends of what you want to do!? Just for testing most of the 7B model are a good compromise between quality and performance (speak execution time)

1 more reply

jsheard2y ago· 2 in thread

callwhendone2y ago

or have 2 gpus

jsheard2y ago

ynniv2y ago· 1 in thread

ultrasaurus2y ago

> wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF...

> chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile

> ./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile -ngl 999

https://euri.ca/blog/2024-llm-self-hosting-is-easy-now/

onion2k2y ago· 1 in thread

Liquix2y ago

seconded - IMHO Jan has the cleanest UI and most straightforward setup out of all LLM frontends available now.

https://jan.ai/

https://github.com/janhq/jan

sn0wr8ven2y ago· 1 in thread

imtringued2y ago

I've gone past the 8k context window with very good text generation on llama3. I don't know what you're smoking.

winwang2y ago· 1 in thread

Check out this subreddit for a decent "source of truth": reddit.com/r/localllama

resource_waste2y ago

Nah, too many fanboys thinking their CPU testing is actually using LLMs.

They will say things like "Its a GPU inside a CPU". No that is the marketers telling you about integrated GPUs.

There is a huge divide between CPU and GPU people. GPU people are doing application. CPU people are... happy that they got anything to run.

spmurrayzzz2y ago

CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

psynister2y ago

Check out Ollama, it's built to run models locally. Llama3 8b runs great locally for me, 70b is very slow. Plenty of options.

b5n2y ago

Quantized 6-8b models run well on consumer GPUs. My concern would be vram limits given you'll likely be expecting the card to do compute _and_ graphics.

Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.

xyc2y ago

pshc2y ago

Hope your game doesn’t have a big texture budget.

root_axis2y ago

Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.

talldayo2y ago

ilaksh2y ago

You can 100% do that with quantized models that are 8b and below. Take a look at ollama to experiment. For incorporating in a game I would probably use llama.cpp or candle.

The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.

There are a lot more options if you can establish that the user has a 3090 or 4090.

Terretta2y ago

Macbook Pro with 128GB RAM runs Llama 3 70B entirely in memory and on GPU. It's remarkable to have a performant LLM that smart and that fast on a (pro)sumer laptop.

j / k navigate · click thread line to collapse