I think we're at least 10-15 years from being able to run low latency agents that "rag" themselves into the games they are a part of, where there are 100's of them, some of them NPC's other's controlling some game mechanic or checking if the output from other agents is acceptable or needs to be run again.
At the moment a macbook air 16 gb can run Phi-Medium 14gb, which is extremely impressive, but it's 7 tokens per second, way to slow for any kind of gaming, you need to 100x performance and we need 5+ generations before i can see this happening.
Unless there's some other application?
I think it's two-fold. The primary one is that it's likely very difficult to maintain a designers storyline vision and desired "atmosphere / feel", because LLM's currently "go off the rails" too easily. The second is that the teams with enough funding to properly fine-tune generative AI to do dialog, level/environment-creation, character-generation, etc. that funding means they're generally making AAA or AAA-adjacent games, which already need so much of a consumer GPU VRAM that there's not a lot left over for large ML models to run in parallel.
I do think though that we should already be seeing indie games doing more with LLM's and 3D character/level/item generation than we are. Of course AI Dungeon has been trailblazing this for a long time but I just expected to see more widely-recognized success by now from many projects. I take this as a signal that it's hard to make a "good" game using AI generation. If anyone has any suggestions for open-world games with significant amount of AI generation that allows player interaction to significantly affect the in-game universe, I'd be very interested in play-testing them. Can be any genre / style / budget. I just want to see more of what people are accomplishing in this space.
My hope is that there will be space for both the current style of game where every aspect is created/designed by a human, as well as for games of various types where the world is given an overall narrative/aesthetic/vision by the creators, but the details are implemented by AI and allows true open-world play where you finally can just walk into any shop and use RAG/etc to allow complete continuity over months/years of play where characters remember your conversations/interactions/actions of you and anyone playing in the same world.
I do think there's something of an "end-game" for this where a game is released that has no game at all in it, but rather generates games for each player based on what they want to play that day, and creates them as you play them. But I'd like to imagine that this won't replace other games (even if it does take a bit of the air out of the room), but rather exist alongside games with human-curated experiences.
I think we're currently stuck in a local minima where AI isn't up to the task of making a coherent player-interactable world, but an incoherent or fragmented and non-interactable world isn't impressive enough (like No Man's Sky).
But this thing still has a long way to go.
Anyone working on top games through mods that wants to explore this, let me know, Next AI Labs would be interested in supporting such efforts.
It's all very exciting, if a little janky.
You're also forgetting that batch performance is already an order of magnitude better than single session inference.
I can't see LLMs in games being used for anything more than some random NPC voice quips. And whose voice would be used? Would voice actors be okay with this?
There are already too many bad games, we certainly don't need thousands more with AI-generated drivel dialogue, although having human writers is not a panacea either way.
> wget https://huggingface.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF...
> chmod +x TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile
> ./TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile -ngl 999
You'll likely want to move beyond the first examples so you can choose models & methods. Either way, LI has tons of great documentation and was originally built for this purpose. They also have a commercial Parsing product with very generous free quotas (last I checked)
https://cookbook.openai.com/examples/parse_pdf_docs_for_rag
There are several other examples like this .. but I got stuck in jargon of Langchain or LlamaIndex etc..
You can also upload files to ChatGPT and ask questions about it.
For example, another comment asked:
"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"
So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.
In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.
It or something like it could likely be applied to any form of generation including what you are describing.
[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...
For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.
Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?
What I think is, from my limited understanding about this field, if smaller models can run on consumer hardware reliably and speedily that would be a game changer.
Not on most consumer computers, which likely lack a dedicated GPU. My M2 struggles (only thing that makes it warm) with a 7B model, but token speed is unbearable. I switched to remote APIs for the speed.
If you are targeting gamers with a GPU, the answer may change, but as others have pointed out, there are numerous issues here.
> This would virtually make inference free right?
Yes-ish, if you are only counting your dollars, however it will slow their computer down and have slow response time, which will impact adoption of your game.
If you want to go this route, I'd start with a 2B sized model, and not worry about shipping it nicely. Get some early users to see if this is the way forward.
I suspect that remote LLM calls with sophisticated caching (cross user / convo / pre-gen'd) is something worth exploring as well. IIRC, people suspected gtp3-turbo was caching common queries and avoided the LLM when it could, for the speed
You can also look into lower parameter models (3B for example) to determine if the balance between accuracy and performance fits under your usecase.
>Is there a way to reliably package these models with existing games and make them run locally? This would virtually make inference free right?
I don't have any knowledge on game dev so I can comment on this but yes, packaging it locally would make the inference free.
8GB vram cards can run 7B models
16GB vram cards can run 13B models
24GB vram cards can run up to 33B models
Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.
I hope that helps, it's not 1:1, and it's a bit confusing
I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.
My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.
I imagine that Copilot+ will become the target minimum spec for many local LLM products and that most local LLM vendors will use GPU instead of NPU if a good GPU is available.
Nvidia 4070 Ti has roughly the same performance: https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...
Of course, I'm massively oversimplifying, but it should be in the ballpark.
One can ran local LLMs even on RaspberryPi, although it will be horribly slow.
The underlying CLI tools do this, the app makes it easier to see and manage.
Thankfully, between llama 3 8b [1] and mistral 7b [2] you have two really capable generic instruction models you can use out of the box that could run locally for many folks. And the base models are straightforward to finetune if you need different capabilities more specific to your game use cases.
CPU/sysmem offloading is an option with gguf-based models but will hinder your latency and throughput significantly.
The quantized versions of the above models do fit easily in many consumer grade gpus (4-5GB for the weights themselves quantized at 4bpw), but it really depends on how much of your vram overhead you want to dedicate to the model weights vs actually running your game.
[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
[2] https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Without a GPU I think it will likely be a poor experience, but it won't be long until you'll have to go out of your way to buy consumer hardware that doesn't integrate some kind of TPU.
Hope your game doesn’t have a big texture budget.
Even more than LLMs, I'm curious about how transformers can be used to produce more convincing game AI in the areas where they are notoriously bad like 4x games.
The game itself is not going to have much VRAM to work with though on older GPUs. Unless you use something fairly tiny like phi3-mini.
There are a lot more options if you can establish that the user has a 3090 or 4090.
They will say things like "Its a GPU inside a CPU". No that is the marketers telling you about integrated GPUs.
There is a huge divide between CPU and GPU people. GPU people are doing application. CPU people are... happy that they got anything to run.