It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.
It can produce answers to questions only slightly less well than ChatGPT (3.5). The Wizard 13B model runs much faster, maybe 2-3 tokens per second.
It is free, private, and runs on a midrange laptop.
A little more than a month ago that wasn’t possible, not with my level of knowledge of the tooling involved at least, now it requires little more than running an executable and minor troubleshooting of python dependencies (on another machine it “just worked”)
So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker”
Vast strides are being made pretty much daily in both quality and efficiency, raising their utility while lowering the cost of usage, doing both to a very significant degree.
I'm sorry but that's unusably slow, even GPT-4 can take a retry or a prompt to fix certain type of issues. My experience is the open options require a lot more attempts/manual prompt tuning.
I can't think of a single workload where that is usable. That said once consumer GPUs are involved it does become usable
Computers got much more powerful in the next 30 years, and ray-tracing or various related techniques appear in more tool sets and games, they didn't fundamentally change the world of image generation or consumption. Most people still roughly interact as before, just with more details in the eye candy.
Are we seeing these large language models today at a tipping point towards unfathomable societal impact, or as something like ray tracing in the 1990s? Will more compute power send us spiraling towards some large-model singularity, or just add more pixels until we are bored of seemingly endless checkerboard planes covered in spheres and cones... I don't know the answer, but it seems like we're seeing camps divided by this question of faith.
For example, the WizardLM model (5_1 quantized, as 4_x quantization makes the 7B models perform terribly):
Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have?
Response:
Let me check that for you! You had 9 apples left after giving away 4, so you still have 9 - 4 = 5 apples left.
Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have? Write out any calculations in full. List the steps used to come to the final result.Response:
Sure, let's work through this problem together! Here are the steps we can use:
Step 1: Start with the initial number of apples you bought (10).
Step 2: Subtract the amount given to your neighbor and sister from the total (2 + 2 = 4). This gives us a new total of 6 apples.
Step 3: Add the additional 5 apples you bought later on to the remaining 6 apples, giving us a final total of 11 apples.
Step 4: Finally, subtract the apple you ate from the total (1), leaving us with 10 apples in total.
So the answer is 10 apples.E.g., Can this model solve math problems? ~%50 of the time. Can this model identify math problems? ~99.9% of the time. Ok, then divert math questions to the math plugin and mathematical-problem-language-parser which has ~99.9% accuracy on supported math problems.
I know it's a stopgap, but it seems like it'd make these models super useful on certain tasks quicker.
In a way it just shows the amazing performance that will come from small future models
Is it comparable to GPT-4? No, it's not remotely close. But it's getting closer every week, and it very recently crossed the threshold of 'it can do stuff I would never send to a cloud service' (namely, reading all of my messages and pulling out action items).
These local projects are great because maybe eventually they will have a equivalent model that can be run on cheap parts
It's also good for math lessons.
Dolly sucks for generating long-form content (not very creative) but if I need a summary or classification, it's quicker and easier to spin up dolly-3b than vicuna-13b.
I suspect OpenAI is routing prompts to select models based on similar logic.
These local models aren't as good as Bard or GPT-4.
If you want more than ~48GB, you're looking at HBM which is extremely expensive (HBM chips are very expensive, packaging+interposer is extremely expensive, designing and producing a new GPU is expensive).
Normal GPUs are limited by both their bus width (wider bus = more pins = harder to design, more expensive to produce, and increases power consumption), and GDDR6(x) (which maxes out at 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't expect anyone to make a 512bit busses anymore) you need 12x2GB (GDDR6 uses 32 pins per package) which gives you 24GB. You can double the memory capacity to 48GB, but that requires putting the chips on the back of the GPU which leads to a bunch of cooling issues (and GDDR6 is expensive).
Of course, even if they did all that they're selling expensive GPUs to a small niche market and cannibalizing sales of their own high end products (and even if AMD somehow managed to magic up a 128GB gpu for $700 people still wouldn't buy it because so much of the ML software is CUDA only).
GDDR achieves higher speeds than normal DDR mainly by specifying much tighter tolerances on the electrical interface, and using wider interface to the memory chips. This means that using commodity GDDR (which is the only fast DRAM that will be reasonably cheap), you have fairly strict limitations on the maximum amount of RAM your can use with the same GPUs that are manufactured for consumer use. (Typically, at most 4x difference between the lowest-end reasonable configuration and the highest-end one, 2x from higher density modules and 2x from using clamshell memory configuration, although often you only have one type of module for a new memory interface generation.)
If the product requires either a new memory or GPU die configuration, it's cost will be very high.
The only type of memory that can support very different VRAM sizes for an efficiently utilized bus of the same size is HBM, and so far that is limited to the very high end.
I hadn't realized just how insane the bandwidth on the higher-ends cards are, the 3090 being just shy of 1 TB/s, yes, one terrabyte per second...
For comparison a couple of DDR5 sticks[2] will just get you north of 70GB/s...
[1]: https://www.anandtech.com/show/15978/micron-spills-on-gddr6x...
[2]: https://www.anandtech.com/show/17269/ddr5-demystified-feat-s...
I was actually wondering about this the other day. A fully maxed out Mac Studio is about $6K, and it comes with a "64-core GPU" and "128GB integrated memory" (whatever any of that means). Would that be enough to run a decent Llama?
It would be nice for Nvidia to release a chip targeted for medium compute/high memory, the lower binning of which should revolve around their max 384b bus on the 4090. But then, it would be hard to financially justify it on their end I suppose.
It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!
At $work CGI assets sometimes grow pretty big and throwing more VRAM at the problem would be easier than optimizing the scenes in the middle of the workflow. They can be optimized, but that often makes it less ergonomic to work with them.
Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.
Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?
Keep in mind video cards don't use the same kind of RAM as consumer CPUs do, they typically use GDDR or HBM.
They do. Well, not “medium performant”, but for VRAM-bound tasks they’d still be an improvement over CPUs if you could use them — iGPUs use main memory.
What they don’t have is support for them for popular GPGPU frameworks (though there was a third party CUDA-for-Intel-iGPUs a while ago.)
Was there a consumer market for them until recently?
Whether it will be co-located with a GPU for consumer hardware remains to be seen.
The thing to determine is how essential running LLMs locally is for consumers.
BigTech is pushing hard to make their clouds the only place to run LLMs unfortunately, so unless there is a killer app that is just better locally (like games were for GPUs), this might not change.
Therapy & relationship bots, like the movie 'Her'. It's ugly, but it's coming.
My system:
GPU: NVidia RTX 2070S (8GB VRAM)
CPU: AMD Ryzen 5 3600 (16GB VRAM)
Here's the performance difference I see:
CPU only (./main -t 12)
llama_print_timings: load time = 15459.43 ms
llama_print_timings: sample time = 23.64 ms / 38 runs ( 0.62 ms per token)
llama_print_timings: prompt eval time = 9338.10 ms / 356 tokens ( 26.23 ms per token)
llama_print_timings: eval time = 31700.73 ms / 37 runs ( 856.78 ms per token)
llama_print_timings: total time = 47192.68 ms
GPU (./main -t 12 -ngl 22) llama_print_timings: load time = 10285.15 ms
llama_print_timings: sample time = 21.60 ms / 35 runs ( 0.62 ms per token)
llama_print_timings: prompt eval time = 3889.65 ms / 356 tokens ( 10.93 ms per token)
llama_print_timings: eval time = 8126.90 ms / 34 runs ( 239.03 ms per token)
llama_print_timings: total time = 18441.22 msIntel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4
./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin
llama_print_timings: load time = 3725.08 ms
llama_print_timings: sample time = 612.06 ms / 536 runs ( 1.14 ms per token)
llama_print_timings: prompt eval time = 13876.81 ms / 259 tokens ( 53.58 ms per token)
llama_print_timings: eval time = 221647.40 ms / 534 runs ( 415.07 ms per token)
llama_print_timings: total time = 239423.46 ms
./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30
llama_print_timings: load time = 7638.95 ms
llama_print_timings: sample time = 280.81 ms / 294 runs ( 0.96 ms per token)
llama_print_timings: prompt eval time = 2197.82 ms / 2 tokens ( 1098.91 ms per token)
llama_print_timings: eval time = 112790.25 ms / 293 runs ( 384.95 ms per token)
llama_print_timings: total time = 120788.82 ms- the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin
- I used 'time' to measure the wall clock time of each command.
- My prompt was:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a long blog post with 5 sections, about the pros and cons of emphasising procedural fluency over conceptual understanding, in high school math education.
### Response:It seems like using the right # cores saves ~25%. and then using the GPU for 26 layers saves another ~43%.
Timings for comparison with `--temp 0`:
-t 6 : 4m10.458s
-t 12 : 5m34.772s
-t 6 -ngl 26 : 2m23.278s276.23 ms per token without GPU, using only -t 12
296.04 ms per token with GPU using -t 12 -ngl 22
How do you determine the right fit for -t and -ngl? I definitely compiled using make LLAMA_CUBLAS=1
I'm using models/ggml-vic7b-uncensored-q5_1.bin, does the model matter significantly?
t: the number of physical cores on your CPU
ngl: just try increasing by one or two, until you start seeing inference crash due to 'out of memory' errors
> i5-10400
You have 6 cores, so try `-t 6`
Imagine I am first ever hearing about this, ;; what did you do?
1. Download the weights for the model you want to use, e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin
2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to compile it with support for CUBLAS (BLAS on GPU).
3. Run the resulting 'main' executable, with the -ngl option set to 18, so that it tries to load 18 layers of the model into the GPU's VRAM, instead of the system's RAM.
Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971
Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?
I don't know if it's quicker but I trust human assessment a lot more than any machine generated explanations. You're right I could have asked ChatGPT or even Googled but a small bit of context goes a long way and I'm clearly out of the loop here -- it's possible others arrive on HN might appreciate such an explanation or we're better off having lots of people making duplicated efforts to understand what they're looking at.
It is also possible to run fine tuned versions like vicuna with this. I think. Those versions are more focused on answering questions.
Literally the second line: "llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet"
I am looking for an open source models to do text summarization. Open AI is too expensive for my use case because I need to pass lots of tokens.
It’s not based on llama.cpp but huggingface transformers but can also run on CPU.
It works well, can be distributed and very conveniently provide the same REST API than OpenAI GPT.
You can run open-source models, but the software itself is closed-source and free for non-commercial use.
Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.
OpenAIs paid GPT4 has few restrictions and is still cheap.
... Not to mention GPT4 with browsing feature is vastly superior to any home of the models you can run at home.
For LLMs this means I am allowed their full potential. I can generate smut, filth, illegal content of any kind for any reason. It’s for me to decide. It’s empowering, it’s the hacker mindset.
A good place to dig for prompt structures may be the 'text-generation-webui' commit log. For example https://github.com/oobabooga/text-generation-webui/commit/33...
It feels somewhat recursive since the input and output are natural language and so you would need another LLM to evaluate whether the model answered a prompt correctly.
Is there a use case for them I’m missing?
Additionally, don’t they all have fairly restrictive licenses?
Would love to run a bunch of models on the machine without dripping $$ to OpenAI, Modal or other providers...
[1] https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...
I'll let others chime in but you could still probably build something really powerful within your budget that is able to run various AI tasks.
Here’s a recent one:
https://www.reddit.com/r/LocalLLaMA/comments/13f5gwn/home_ll...
I have a ThinkStation P620 w/ThreadRipper Pro 3945WX (12c24t) with a GTX 1070 (and a second 1070 I could put in there) and there's 512GB of RAM on the box.
Does this need to be bare metal, or can it run in VM?
I'm currently running RHEL 9.2 w/KVM (as a VM host) with light usage so far.
Anyone have a recommended guide for AMD / Intel GPUs? I gather the 4 bit quantization is the special sauce for CUDA, but I’d guess there’d be something comparable for not-CUDA?
https://github.com/ROCm-Developer-Tools/HIPIFY/blob/master/R...
https://github.com/hughperkins/coriander
I have zero experience with these, though.
ML models are essentially trained to recognize patterns. Encryption algorithms are explicitly designed to resist that kind of analysis. LLMs are not magic.
The smallest GPU-only 7B 4-bit model requires 8GB VRAM, so it's either do CPU only or use the GPU offload above.
https://www.reddit.com/r/LocalLLaMA/comments/13fnyah/you_guy...
https://chat.lmsys.org/?arena (Click 'leaderboard')
If you're using oobabooga/text-generation-webui then you need to:
1. Re-install llama-cpp-python with support for CUBLAS:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall
2. Launch the web UI with the --n-gpu-layers flag, e.g. python server.py --model gpt4-x-vicuna-13B.ggml.q5_1.bin --n-gpu-layers 24