Run Llama 13B with a 6GB graphics card (opens in new tab)

(gist.github.com)

618 pointsrain13y ago266 comments

266 comments

144 comments · 28 top-level

naillo3y ago· 30 in thread

This is cool but are people actually getting stuff done with these models? I'm enthusiastic about their potential too but after playing with it for a day I'm at a loss for what to use it for anymore at this point

ineedasername3y ago

I can run the Wizard 30B ggml model in CPU mode using a Ryzen 5700 and 16GB of system RAM, not GPU VRAM. I’m using oobabooga as the front end.

It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

It can produce answers to questions only slightly less well than ChatGPT (3.5). The Wizard 13B model runs much faster, maybe 2-3 tokens per second.

It is free, private, and runs on a midrange laptop.

A little more than a month ago that wasn’t possible, not with my level of knowledge of the tooling involved at least, now it requires little more than running an executable and minor troubleshooting of python dependencies (on another machine it “just worked”)

So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker”

Vast strides are being made pretty much daily in both quality and efficiency, raising their utility while lowering the cost of usage, doing both to a very significant degree.

theaiquestion3y ago

> It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. Running it overnight on something like “summarize an analysis of topic X it does a reasonable job.

I'm sorry but that's unusably slow, even GPT-4 can take a retry or a prompt to fix certain type of issues. My experience is the open options require a lot more attempts/manual prompt tuning.

I can't think of a single workload where that is usable. That said once consumer GPUs are involved it does become usable

3 more replies

saltcured3y ago

I don't know if anybody is following this thread anymore, but I find it interesting how similarly your timelines match what it was like to experiment with POV-Ray (a ray-tracing renderer) back in the early 1990s. Your difference in problem scope was like whether you had "a couple spheres on a checkerboard plane" or something more like "a chess set". Things seemed to change rapidly due to Moore's Law and the changes in brute force computing power available to normal people.

Computers got much more powerful in the next 30 years, and ray-tracing or various related techniques appear in more tool sets and games, they didn't fundamentally change the world of image generation or consumption. Most people still roughly interact as before, just with more details in the eye candy.

Are we seeing these large language models today at a tipping point towards unfathomable societal impact, or as something like ray tracing in the 1990s? Will more compute power send us spiraling towards some large-model singularity, or just add more pixels until we are bored of seemingly endless checkerboard planes covered in spheres and cones... I don't know the answer, but it seems like we're seeing camps divided by this question of faith.

1 more reply

BaculumMeumEst3y ago

Wow you can run a 30B model on 16gb ram? Is it hitting swap?

3 more replies

sp3323y ago

What prompt do you use to get haikus?

2 more replies

Joeri3y ago

My experience is that even the "toy" 7B models can do pretty amazing things, but they're very sensitive to prompting in the right way to force the model down the right pathways. My assumption is that the larger models have more redundancy and therefore can be prompted less precisely while still going down the right pathway.

For example, the WizardLM model (5_1 quantized, as 4_x quantization makes the 7B models perform terribly):

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have?

Response:

  Let me check that for you! You had 9 apples left after giving away 4, so you still have 9 - 4 = 5 apples left.

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have? Write out any calculations in full. List the steps used to come to the final result.

Response:

  Sure, let's work through this problem together! Here are the steps we can use:
  Step 1: Start with the initial number of apples you bought (10).
  Step 2: Subtract the amount given to your neighbor and sister from the total (2 + 2 = 4). This gives us a new total of 6 apples.
  Step 3: Add the additional 5 apples you bought later on to the remaining 6 apples, giving us a final total of 11 apples.
  Step 4: Finally, subtract the apple you ate from the total (1), leaving us with 10 apples in total.
  So the answer is 10 apples.

mcnnowak3y ago

I've always wondered if we could get a "good enough" general model, then add plugins for shortcomings.

E.g., Can this model solve math problems? ~%50 of the time. Can this model identify math problems? ~99.9% of the time. Ok, then divert math questions to the math plugin and mathematical-problem-language-parser which has ~99.9% accuracy on supported math problems.

I know it's a stopgap, but it seems like it'd make these models super useful on certain tasks quicker.

olabyne3y ago

I agree with you. That's why all to the small model are showing some benchmarks putting them close to GPT3.5 or even 4, only because they use specific test tasks !

In a way it just shows the amazing performance that will come from small future models

Taek3y ago

I've successfully gotten at-home models (https://huggingface.co/NousResearch/GPT4-x-Vicuna-13b-fp16) to go through my messages and pull out key todo items. For example, reminding me to message my father about travel plans.

Is it comparable to GPT-4? No, it's not remotely close. But it's getting closer every week, and it very recently crossed the threshold of 'it can do stuff I would never send to a cloud service' (namely, reading all of my messages and pulling out action items).

MPSimmons3y ago

How do you deal with the 'too many tokens' problem? Like, For 13b, max tokens I've gotten it to is 508 - is there a way to increase the number it will accept?

1 more reply

instance3y ago

I tested on a serious use case and quality was subpar. For real use cases I had to either host the most powerful model you can get (e.g. LLaMA-65B or so) on a cloud machine, which again costs too much (you'll be paying like 500-1000 USD per month), or just go straight for GPT-3.5 on OpenAI. The latter economically makes most sense.

inferense3y ago

what real use case did you use it for?

2 more replies

raffraffraff3y ago

What about turning the cloud vm off except when you're actually using it?

2 more replies

m3kw93y ago

Most part if you actually need stuff done, you can afford 20$ for OpenAIs proven models.

These local projects are great because maybe eventually they will have a equivalent model that can be run on cheap parts

thelittleone3y ago

Knowledge cut off and sending potentially sensitive IP to a centralised and untrusted third party. This would likely limit the addressable market.

1 more reply

Beltiras3y ago

The knowledge cutoff date is brutal for some applications.

1 more reply

elorant3y ago

I've setup and use Vicuna-13b for text classification, summarization and topic modelling. Works like a charm.

It's also good for math lessons.

jstarfish3y ago

Maybe others' experiences are different, but I find smaller models to work just as well for "reductive" tasks.

Dolly sucks for generating long-form content (not very creative) but if I need a summary or classification, it's quicker and easier to spin up dolly-3b than vicuna-13b.

I suspect OpenAI is routing prompts to select models based on similar logic.

BOOSTERHIDROGEN3y ago

Would like to know how you setup this. A posts would be awesome.

1 more reply

redox993y ago

People are extensively using these models (more specifically the finetuned, uncensored ones) for role playing.

irzzy3y ago

I’m interested in this. Can you elaborate?

1 more reply

iforgotpassword3y ago

I haven't tried the fine-tuned variants yet, but when I played around with it shortly after the leak, it tended to quickly derail into nonsense when you let it complete sentences or paragraphs, especially when using other languages than English. When I tried to get it to produce Python code, most of the time it wasn't even syntactically correct.

jhbadger3y ago

Depends on what "getting stuff done" means. I find 13B models (running on my M1 Mac) useful for playing AI Dungeon-like games -- where you describe the environment and your character and you can play an RPG.

fredguth3y ago

GitHub Copilot is (or once was) a 13b model, according to Nat Friedman in the scale.ai interview. (https://youtu.be/lnufceCxwG0)

xrd3y ago

Can you share more about what you are doing or what you've seen others doing?

1 more reply

rain1OP3y ago

It's just for fun!

These local models aren't as good as Bard or GPT-4.

happycube3y ago

There are two major advantages though - you can retrain them, and they're not on the guardrails that the commercial models have.

1 more reply

logicchains3y ago

LLaMA 65B in my experience is comparable to GPT3.5 or the old Bard for usefulness at most things.

silvervine3y ago

I am aiming to connect it to my fav game engine to make a character speak in generative sentences. I am aiming to do this mainly for fun however.

Zetobal3y ago

We run some llamas to analyze user content.

holoduke3y ago· 23 in thread

Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price. These models require lots of memory to 'single' pass an operation. Throughput could be bit slower. A 1080 Nvidia with 256gb of memory would run all these models fast right? Or am I forgetting something here.

elabajaba3y ago

Because they can't do that for a "good consumer price".

If you want more than ~48GB, you're looking at HBM which is extremely expensive (HBM chips are very expensive, packaging+interposer is extremely expensive, designing and producing a new GPU is expensive).

Normal GPUs are limited by both their bus width (wider bus = more pins = harder to design, more expensive to produce, and increases power consumption), and GDDR6(x) (which maxes out at 2GB/chip currently), so on a 384bit bus (4090/7900xtx, don't expect anyone to make a 512bit busses anymore) you need 12x2GB (GDDR6 uses 32 pins per package) which gives you 24GB. You can double the memory capacity to 48GB, but that requires putting the chips on the back of the GPU which leads to a bunch of cooling issues (and GDDR6 is expensive).

Of course, even if they did all that they're selling expensive GPUs to a small niche market and cannibalizing sales of their own high end products (and even if AMD somehow managed to magic up a 128GB gpu for $700 people still wouldn't buy it because so much of the ML software is CUDA only).

eurekin3y ago

3090 has a lot of vram chips on the back though

1 more reply

Tuna-Fish3y ago

It would not be trivial to do.

GDDR achieves higher speeds than normal DDR mainly by specifying much tighter tolerances on the electrical interface, and using wider interface to the memory chips. This means that using commodity GDDR (which is the only fast DRAM that will be reasonably cheap), you have fairly strict limitations on the maximum amount of RAM your can use with the same GPUs that are manufactured for consumer use. (Typically, at most 4x difference between the lowest-end reasonable configuration and the highest-end one, 2x from higher density modules and 2x from using clamshell memory configuration, although often you only have one type of module for a new memory interface generation.)

If the product requires either a new memory or GPU die configuration, it's cost will be very high.

The only type of memory that can support very different VRAM sizes for an efficiently utilized bus of the same size is HBM, and so far that is limited to the very high end.

magicalhippo3y ago

Anandtech has an article on the GDDR6X variant[1] that NVIDIA has in their 3000-cards, where they use a more complex encoding to transmit two bits per clock edge.

I hadn't realized just how insane the bandwidth on the higher-ends cards are, the 3090 being just shy of 1 TB/s, yes, one terrabyte per second...

For comparison a couple of DDR5 sticks[2] will just get you north of 70GB/s...

[1]: https://www.anandtech.com/show/15978/micron-spills-on-gddr6x...

[2]: https://www.anandtech.com/show/17269/ddr5-demystified-feat-s...

1 more reply

q7xvh97o2pDhNrh3y ago

Do you happen to know where Apple's integrated approach falls on this spectrum?

I was actually wondering about this the other day. A fully maxed out Mac Studio is about $6K, and it comes with a "64-core GPU" and "128GB integrated memory" (whatever any of that means). Would that be enough to run a decent Llama?

2 more replies

fennecfoxy3y ago

Well yeah I guess binned cards come into play, cheaper binned cards have a narrower bus. It seems there are quite a few models that aren't too heavy on compute but require a tonne of vram.

It would be nice for Nvidia to release a chip targeted for medium compute/high memory, the lower binning of which should revolve around their max 384b bus on the 4090. But then, it would be hard to financially justify it on their end I suppose.

0xcde4c3db3y ago

Probably because if they take that exact same GPU+VRAM configuration and slap it on a rackmount-optimized board, they can charge AI companies 5-10x the price for it.

jsheard3y ago

They don't even offer that much VRAM on cards aimed at those price-insensitive customers, Nvidias current lineup maxes out at 48GB for GDDR-based models or 80GB for HBM-based models. Even if money is no object there's still practical engineering limits on how much memory they can put on a card without sacrificing bandwidth.

1 more reply

vegabook3y ago

this is where the new third player, Intel, can (if it can tear itself away from identical behaviour in the consumer/server CPU market) hopefully break the duopoly. Love to see a 32 or 64GB card from Intel. Their software stack on Linux is competent enough (unlike the dumpster fire that is AMD's ROCm).

fomine33y ago

This must be the reason. AI homelab market is too small while enterprise market will become extremely big.

hackernudes3y ago

I don't think there was a market for it before LLMs. Still might not be (especially if they don't want to cannibalize data center products). Also, they might have hardware constraints. I wouldn't be that surprised if we see some high ram consumer GPUs in the future, though.

It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!

the84723y ago

> I don't think there was a market for it before LLMs.

At $work CGI assets sometimes grow pretty big and throwing more VRAM at the problem would be easier than optimizing the scenes in the middle of the workflow. They can be optimized, but that often makes it less ergonomic to work with them.

Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.

Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?

2 more replies

layer83y ago

Releasing a new model takes time, and it’s unclear how large the consumer market would actually be. Maybe they’re working on it right now.

kevingadd3y ago

The margins on VRAM are pretty bad for them since they don't manufacture it themselves. And every memory module they add needs additional power delivery and memory controller muscle to drive, so adding that memory is going to raise the cost of the card significantly. Most games and consumer workloads won't use all that extra memory.

Keep in mind video cards don't use the same kind of RAM as consumer CPUs do, they typically use GDDR or HBM.

dragonwriter3y ago

> Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price.

They do. Well, not “medium performant”, but for VRAM-bound tasks they’d still be an improvement over CPUs if you could use them — iGPUs use main memory.

What they don’t have is support for them for popular GPGPU frameworks (though there was a third party CUDA-for-Intel-iGPUs a while ago.)

duxup3y ago

>for a good consumer price

Was there a consumer market for them until recently?

fooker3y ago

That's exactly what the next generation of 'accelerators' will be like.

Whether it will be co-located with a GPU for consumer hardware remains to be seen.

The thing to determine is how essential running LLMs locally is for consumers.

BigTech is pushing hard to make their clouds the only place to run LLMs unfortunately, so unless there is a killer app that is just better locally (like games were for GPUs), this might not change.

boppo13y ago

> unless there is a killer app that is just better locally

Therapy & relationship bots, like the movie 'Her'. It's ugly, but it's coming.

1 more reply

pbhjpbhj3y ago

There's a type of DMA for GPUs to access NVMe on the motherboard, IIRC. Perhaps that is a better solution here?

https://developer.nvidia.com/blog/gpudirect-storage/

boppo13y ago

Isn't pci-e latency dramatically higher than onboard vram?

andromeduck3y ago

Because then memory would be 90% of the BOM.

Kye3y ago

GDDR probably hasn't seen the same cost reduction benefits of volume DDR has.

TaylorAlexander3y ago

One question I have is: can they use cheaper kinds of RAM and still be perfectly usable for large ML models? They could put 4GB of GDDR and 128GB of cheap RAM maybe? I do realize as others are saying, this would be a new kind of card so they will need time to develop it. But would this work?

1 more reply

rahimnathwani3y ago· 14 in thread

On my system, using `-ngl 22` (running 22 layers on the GPU) cuts wall clock time by ~60%.

My system:

GPU: NVidia RTX 2070S (8GB VRAM)

CPU: AMD Ryzen 5 3600 (16GB VRAM)

Here's the performance difference I see:

CPU only (./main -t 12)

  llama_print_timings:        load time = 15459.43 ms
  llama_print_timings:      sample time =    23.64 ms /    38 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  9338.10 ms /   356 tokens (   26.23 ms per token)
  llama_print_timings:        eval time = 31700.73 ms /    37 runs   (  856.78 ms per token)
  llama_print_timings:       total time = 47192.68 ms

GPU (./main -t 12 -ngl 22)

  llama_print_timings:        load time = 10285.15 ms
  llama_print_timings:      sample time =    21.60 ms /    35 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  3889.65 ms /   356 tokens (   10.93 ms per token)
  llama_print_timings:        eval time =  8126.90 ms /    34 runs   (  239.03 ms per token)
  llama_print_timings:       total time = 18441.22 ms

guardiangod3y ago

I am testing it on an AWS instance and the speedup effect is not as consistent as I hope. The speedup varies between runs.

Intel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4

  ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin

  llama_print_timings:        load time =  3725.08 ms
  llama_print_timings:      sample time =   612.06 ms /   536 runs   (    1.14 ms per token)
  llama_print_timings: prompt eval time = 13876.81 ms /   259 tokens (   53.58 ms per token)
  llama_print_timings:        eval time = 221647.40 ms /   534 runs   (  415.07 ms per token)
  llama_print_timings:       total time = 239423.46 ms

  ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30

  llama_print_timings:        load time =  7638.95 ms
  llama_print_timings:      sample time =   280.81 ms /   294 runs   (    0.96 ms per token)
  llama_print_timings: prompt eval time =  2197.82 ms /     2 tokens ( 1098.91 ms per token)
  llama_print_timings:        eval time = 112790.25 ms /   293 runs   (  384.95 ms per token)
  llama_print_timings:       total time = 120788.82 ms

rahimnathwani3y ago

Thanks. BTW:

- the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin

- I used 'time' to measure the wall clock time of each command.

- My prompt was:

  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Instruction:
  Write a long blog post with 5 sections, about the pros and cons of emphasising procedural fluency over conceptual understanding, in high school math education.
  ### Response:

eightysixfour3y ago

You will likely see a bit of a performance gain dropping your threads to 6. I'm on a 3700x and get a regression when using 16 threads instead of the real 8 cores.

rahimnathwani3y ago

Ha! You're right. Honestly, I couldn't recall how many cores I had, and I figured it was faster to run `cat /proc/cpuinfo |grep processor| wc -l` than to google the specs.

It seems like using the right # cores saves ~25%. and then using the GPU for 26 layers saves another ~43%.

Timings for comparison with `--temp 0`:

  -t 6 : 4m10.458s
  -t 12 : 5m34.772s
  -t 6 -ngl 26 : 2m23.278s

2 more replies

xrd3y ago

I'm experimenting with these settings. On my machine (with a 6GB NVidia GEForce, 32 GB of RAM and Intel i5-10400 CPU @ 2.90GHz) I get worse performance offloading to the GPU.

276.23 ms per token without GPU, using only -t 12

296.04 ms per token with GPU using -t 12 -ngl 22

How do you determine the right fit for -t and -ngl? I definitely compiled using make LLAMA_CUBLAS=1

I'm using models/ggml-vic7b-uncensored-q5_1.bin, does the model matter significantly?

rahimnathwani3y ago

> How do you determine the right fit for -t and -ngl?

t: the number of physical cores on your CPU

ngl: just try increasing by one or two, until you start seeing inference crash due to 'out of memory' errors

> i5-10400

You have 6 cores, so try `-t 6`

1 more reply

samstave3y ago

May you please ELI5 what is happening here...

Imagine I am first ever hearing about this, ;; what did you do?

rahimnathwani3y ago

0. Have a PC with an NVidia GPU, running Ubuntu, with the NVidia drivers and CUDA Toolkit already set up.

1. Download the weights for the model you want to use, e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin

2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to compile it with support for CUBLAS (BLAS on GPU).

3. Run the resulting 'main' executable, with the -ngl option set to 18, so that it tries to load 18 layers of the model into the GPU's VRAM, instead of the system's RAM.

2 more replies

cpill3y ago

Will this work with the leaked models or Alpaca?

MPSimmons3y ago

Yes

PaulWaldman3y ago

Any way to know the differences in power consumption?

Tuna-Fish3y ago

Probably significant savings.

rain1OP3y ago

That is a crazy speedup!!

GordonS3y ago

Is it really? Going from CPU to GPU, I would have expected a much better improvement.

2 more replies

s_dev3y ago· 7 in thread

[deleted]

capableweb3y ago

First link: https://github.com/ggerganov/llama.cpp

Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971

Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

s_dev3y ago

>Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

I don't know if it's quicker but I trust human assessment a lot more than any machine generated explanations. You're right I could have asked ChatGPT or even Googled but a small bit of context goes a long way and I'm clearly out of the loop here -- it's possible others arrive on HN might appreciate such an explanation or we're better off having lots of people making duplicated efforts to understand what they're looking at.

1 more reply

djbusby3y ago

You gave an awesome answer in 2 minutes! Might be faster than reading!

1 more reply

rain1OP3y ago

llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet.

It is also possible to run fine tuned versions like vicuna with this. I think. Those versions are more focused on answering questions.

haunter3y ago

>I can't tell from the Gist alone

Literally the second line: "llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet"

rain1OP3y ago

I'm sorry! I added this improvement based on that persons question!

s_dev3y ago

Sorry -- I missed that. I'll delete my comments -- obviously I'm just an idiot asking dumb questions that have no value to anybody. I thought I read through it.

1 more reply

marcopicentini3y ago· 5 in thread

What do you use to host these models (like Vicuna, Dolly etc) on your own server and expose them using HTTP REST API? Is there an Heroku-like for LLM models?

I am looking for an open source models to do text summarization. Open AI is too expensive for my use case because I need to pass lots of tokens.

speedgoose3y ago

These days I use FastChat: https://github.com/lm-sys/FastChat

It’s not based on llama.cpp but huggingface transformers but can also run on CPU.

It works well, can be distributed and very conveniently provide the same REST API than OpenAI GPT.

itake3y ago

Do you know how well it performs compared to llama.cpp?

1 more reply

rain1OP3y ago

I haven't tried that but https://github.com/abetlen/llama-cpp-python and https://github.com/r2d4/openlm exists

inhumantsar3y ago

Weights and Biases is good for building/training models and Lambda Labs is a cloud provider for AI workloads. Lambda will only get you up to running the model though. You would still need to overlay some job management on top of that. I've heard Run.AI is good on that front but I haven't tried.

month133y ago

https://bellard.org/ts_server/ may be what you are after.

You can run open-source models, but the software itself is closed-source and free for non-commercial use.

mozillas3y ago· 5 in thread

I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air (8GB RAM) with llama.cpp.

Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.

koheripbal3y ago

Given the current price (mostly free) off public llms I'm not sure what the use case of running out at home are yet.

OpenAIs paid GPT4 has few restrictions and is still cheap.

... Not to mention GPT4 with browsing feature is vastly superior to any home of the models you can run at home.

toxik3y ago

The point for me personally is the same as why I find it so powerful to self host SMTP, IMAP, HTTP. It’s in my hands, I know where it all begins and ends. I answer to no one.

For LLMs this means I am allowed their full potential. I can generate smut, filth, illegal content of any kind for any reason. It’s for me to decide. It’s empowering, it’s the hacker mindset.

2devnull3y ago

Many would be users can’t send their data data to openAI. Think HIPPA and other laws restricting data sharing. Federation or distribution of the models for local training is the other solution to that problem.

sagarm3y ago

I think it's mostly useful if you want to do your own fine tuning, or the data you are working with can't be sent to a third party for contractual, legal, or paranoid reasons.

1 more reply

int_19h3y ago

GPT4 API is still not universally available, for starters.

syntaxing3y ago· 4 in thread

This update is pretty exciting, I’m gonna try running a large model (65B) with a 3090. I have ran a ton of local LLM but the hardest part is finding out the prompt structure. I wish there is some sort of centralized data base that explains it.

rain1OP3y ago

Tell us how it goes! Try different numbers of layers if needed.

A good place to dig for prompt structures may be the 'text-generation-webui' commit log. For example https://github.com/oobabooga/text-generation-webui/commit/33...

int_19h3y ago

I tried llama-65b on a system with RTX 4090 + 64Gb of DDR5 system RAM. I can push up to 45 layers (out of 80) to the GPU, and the overall performance is ~800ms / token, which is "good enough" for real-time chat.

guardiangod3y ago

I got the alpaca 65B GGML model to run on my 64GB ram laptop. No GPU required if you can tolerate the 1 token per 3 seconds rate.

syntaxing3y ago

Supposedly the new update with GPU offloading will bring that up to 10 tokens per second! 1 token per second is painfully slow, that’s about 30s for a sentence.

1 more reply

tarr113y ago· 4 in thread

What is the state of the art on evaluating the accuracy of these models? Is there some equivalent to an “end to end test”?

It feels somewhat recursive since the input and output are natural language and so you would need another LLM to evaluate whether the model answered a prompt correctly.

tikkun3y ago

https://chat.lmsys.org/?arena (Click 'leaderboard')

klysm3y ago

It’s going to be very difficult to come up with any rigorous structure for automatically assessing the outputs of these models. They’re built using effectively human grading of the answers

sroussey3y ago

Llama cpp and others use perplexity:

https://huggingface.co/docs/transformers/perplexity

RockyMcNuts3y ago

hmmh, if we have the reinforcement learning part of reinforcement learning with human feedback, isn't that a model that takes a question/answer pair and rates the quality of the answer? it's sort of grading itself, it's like a training loss but it still tells us something?

dinobones3y ago· 4 in thread

What is HN’s fascination with these toy models that produce low quality, completely unusable output?

Is there a use case for them I’m missing?

Additionally, don’t they all have fairly restrictive licenses?

tbalsam3y ago

I never thought I'd see the day when a 13B model was casually referred to in a comments section as a "toy model".

az2263y ago

Compared to GPT2 it’s on par. Compared to GPT3, 3.5, or 4, it’s a toy. GPT2 is 4 years old, and in terms of LLMs, that’s several life times ago. In 5-10 years, GPT3 will be viewed as a toy. Note, “progress” will unlikely be as fast as it has been going forward.

2 more replies

andrewmcwatters3y ago

Start using it for tasks and you'll find limitations very quickly. Even ChatGPT excels at some tasks and fails miserably at others.

1 more reply

Zetobal3y ago

Maybe you forgot for what the H stands in HN... playful curiousity.

yawnxyz3y ago· 3 in thread

Could someone please share a good resource for building a machine from scratch, for doing simple-ish training and running open-source models like Llama? I'd love to run some of these and even train them from scratch, and I'd love to use that as an excuse to drop $5k on a new machine...

Would love to run a bunch of models on the machine without dripping $$ to OpenAI, Modal or other providers...

vonseel3y ago

I am no where near an expert on this subject, and this information is from a few months ago so maybe it's outdated, but people on Reddit[1] are claiming running the llama with 65B parameters would need like 20K+ of GPUs. A 40GB A100 looks like it's almost $8K on Amazon, and I'm sure you could do a lot with just one of those, but that's already beyond your $5K budget.

[1] https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...

I'll let others chime in but you could still probably build something really powerful within your budget that is able to run various AI tasks.

logicchains3y ago

You can get around 4-5 tokens per second on the 65B LLaMA with a 32 core 256GB ram Ryzen CPU, not sure how much it costs to build but can rent one from Hetzner for around two hundred bucks a month.

Joeri3y ago

There are some threads with hardware recommendations in the LocalLLaMa subreddit.

Here’s a recent one:

https://www.reddit.com/r/LocalLLaMA/comments/13f5gwn/home_ll...

bitL3y ago· 3 in thread

How about reloading parts of the model as the inference progresses instead of splitting it into GPU/CPU parts? Reloading would be memory-limited to the largest intermediate tensor cut.

regularfry3y ago

That would turn what's currently an L3 cache miss or a GPU data copy into a disk I/O stall. Not that it might not be possible to pipeline things to make that less of a problem, but it doesn't immediately strike me as a fantastic trade-off.

bitL3y ago

One can keep all tensors in the RAM, just push whatever needed to GPU VRAM, basically limited by PCIe speed. Or some intelligent strategy with read-ahead from SSD if one's RAM is limited. There are even GPUs with their own SSDs.

moffkalast3y ago

The Tensor Reloaded, starring Keanu Reeves

akulbe3y ago· 3 in thread

I've only ever been a consumer of ChatGPT/Bard. Never set up any LLM stuff locally, but the idea is appealing to me.

I have a ThinkStation P620 w/ThreadRipper Pro 3945WX (12c24t) with a GTX 1070 (and a second 1070 I could put in there) and there's 512GB of RAM on the box.

Does this need to be bare metal, or can it run in VM?

I'm currently running RHEL 9.2 w/KVM (as a VM host) with light usage so far.

xrd3y ago

Do you really have 512 GB of RAM on that box?

akulbe3y ago

Yes. I really do. Do you need to see pictures to believe it?

davidy1233y ago

You can run it with CPU or GPU in the VM, in the latter case you'd need to pass the GPU(s through, which depends what kind of VM you're using.

peatmoss3y ago· 2 in thread

From skimming, it looks like this approach requires CUDA and thus is Nvidia only.

Anyone have a recommended guide for AMD / Intel GPUs? I gather the 4 bit quantization is the special sauce for CUDA, but I’d guess there’d be something comparable for not-CUDA?

rain1OP3y ago

4-bit quantization is to reduce the amount of VRAM required to run the model. You can run it 100% on CPU if you don't have CUDA. I'm not aware of any AMD equivalent yet.

amelius3y ago

Looks like there are several projects that implement the CUDA interface for various other compute systems, e.g.:

https://github.com/ROCm-Developer-Tools/HIPIFY/blob/master/R...

https://github.com/hughperkins/coriander

I have zero experience with these, though.

2 more replies

hhh3y ago· 2 in thread

Instructions are a bit rough. The Micromamba thing doesn’t work, doesn’t say how to install it… you have to clone llama.cpp too

rain1OP3y ago

Apologies for that. I've added some extra micromamba setup commands that I should have included before!

I've also added the git clone command, thank you for the feedback

hhh3y ago

Appreciate it! This is much better!

dclowd99013y ago· 2 in thread

Has anyone tried running encryption algorithms through these models? I wonder if it could be trained to decrypt.

Hendrikto3y ago

That would be very surprising, given that any widely used cryptographic encryption algorithm has been EXTENSIVELY cryptanalyzed.

ML models are essentially trained to recognize patterns. Encryption algorithms are explicitly designed to resist that kind of analysis. LLMs are not magic.

dclowd99013y ago

All of what you said is true, for us. I know LLMs aren’t magic (lord knows I actually kind of understand the principles of how they operate), but they have a much greater computational and relational bandwidth than we’ve ever had access to before. So I’m curious if that can break down what otherwise appears to be complete obfuscation. Otherwise, we’re saying that encryption is somehow magic in a way that LLMs cannot possibly be.

2 more replies

blendergeek3y ago· 2 in thread

Is there a way to run any of these with only 4GB of VRAM?

washadjeffmad3y ago

Assuming an nvidia GPU and requisite system memory, use llama.cpp compiled with cublas support, then run with the -ngl [n layers] option. You'll need a model quantized after May 12 to work with this.

The smallest GPU-only 7B 4-bit model requires 8GB VRAM, so it's either do CPU only or use the GPU offload above.

blendergeek3y ago

Thank you! I'll give it a try.

anshumankmr3y ago· 1 in thread

How long before it runs on a 4 gig card?

rain1OP3y ago

You can offload only 10 layers or so if you want to run on a 4GB card

MuffinFlavored3y ago· 1 in thread

How many "B" (billions of parameters) is ChatGPT GPT-4?

sciolist3y ago

Information about GPT-4 was not released

ACV0013y ago· 1 in thread

The future is this - these models will be able to run on smaller and smaller hardware eventually being able to run on your phone, watch or embedded devices. The revolution is here and is inevitable. Similar to how computers evolved. We are still lucky that these models have no consciousness, still. Once they gain consciousness, that will mark the appearance of a new species (superior to us if anything). Also, luckily, they have no physical bodies and cannot replicate, so far...

olabyne3y ago

The phone part is already there ! https://mlc.ai/mlc-llm/ (granted, this is only a 7b-model running with 4bits)

tikkun3y ago

https://chat.lmsys.org/?arena (Click 'leaderboard')

Ambix3y ago

No need to convert models, 4bit LLaMA versions for GGML v2 available here:

https://huggingface.co/gotzmann/LLaMA-GGML-v2/tree/main

rahimnathwani3y ago

PSA:

If you're using oobabooga/text-generation-webui then you need to:

1. Re-install llama-cpp-python with support for CUBLAS:

  CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir --force-reinstall

2. Launch the web UI with the --n-gpu-layers flag, e.g.

  python server.py --model gpt4-x-vicuna-13B.ggml.q5_1.bin --n-gpu-layers 24

sroussey3y ago

I wish this used the webgpu c++ library instead, then it could be used in any GPU hardware.

qwertox3y ago

If I really want to do some playing around in this area, would it be good to get a RTX 4000 SFF which has 20 GB of VRAM but is a low-power card, which I want as it would be running 24/7 and energy prices are pretty bad in Germany, or would it make more sense to buy an Apple product with some M2 chip which apparently is good for these tasks as it shares CPU and GPU memory?

ranger_danger3y ago

Why can't these models run on the GPU while also using CPU RAM for the storage? That way people will performant-but-memory-starved GPUs can still utilize the better performance of the GPU calculation while also having enough RAM to store the model? I know it is possible to provide system RAM-backed GPU objects.

BlackLotus893y ago

This only uses llama correct? So the output should be the same as if you were only using llama.cpp. Am I the only one who doesn't get nearly the same quality of output using a quantized model compared to GPU? Some models I tried get astounding results when running on a GPU, but create only "garbage" when running on a CPU. Even when not quantized down to 4bit llama.cpp just doesn't compare for me. Am I alone with this?

alg_fun3y ago

wouldn't i be faster to use ram as a swap for vram?

avereveard3y ago

or like download oobabooga/text-generation-webui, any prequantized variant, and be done.

j / k navigate · click thread line to collapse

266 comments

144 comments · 28 top-level

naillo3y ago· 30 in thread

ineedasername3y ago

I can run the Wizard 30B ggml model in CPU mode using a Ryzen 5700 and 16GB of system RAM, not GPU VRAM. I’m using oobabooga as the front end.

It can produce answers to questions only slightly less well than ChatGPT (3.5). The Wizard 13B model runs much faster, maybe 2-3 tokens per second.

It is free, private, and runs on a midrange laptop.

So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker”

Vast strides are being made pretty much daily in both quality and efficiency, raising their utility while lowering the cost of usage, doing both to a very significant degree.

theaiquestion3y ago

I'm sorry but that's unusably slow, even GPT-4 can take a retry or a prompt to fix certain type of issues. My experience is the open options require a lot more attempts/manual prompt tuning.

I can't think of a single workload where that is usable. That said once consumer GPUs are involved it does become usable

3 more replies

saltcured3y ago

1 more reply

BaculumMeumEst3y ago

Wow you can run a 30B model on 16gb ram? Is it hitting swap?

3 more replies

sp3323y ago

What prompt do you use to get haikus?

2 more replies

Joeri3y ago

For example, the WizardLM model (5_1 quantized, as 4_x quantization makes the 7B models perform terribly):

Prompt: I went to the market and bought 10 apples. I gave 2 to my neighbor and 2 to my sister. Then I bought 5 more and ate 1. How many apples do I have?

Response:

  Let me check that for you! You had 9 apples left after giving away 4, so you still have 9 - 4 = 5 apples left.

Response:

  Sure, let's work through this problem together! Here are the steps we can use:
  Step 1: Start with the initial number of apples you bought (10).
  Step 2: Subtract the amount given to your neighbor and sister from the total (2 + 2 = 4). This gives us a new total of 6 apples.
  Step 3: Add the additional 5 apples you bought later on to the remaining 6 apples, giving us a final total of 11 apples.
  Step 4: Finally, subtract the apple you ate from the total (1), leaving us with 10 apples in total.
  So the answer is 10 apples.

mcnnowak3y ago

I've always wondered if we could get a "good enough" general model, then add plugins for shortcomings.

I know it's a stopgap, but it seems like it'd make these models super useful on certain tasks quicker.

olabyne3y ago

I agree with you. That's why all to the small model are showing some benchmarks putting them close to GPT3.5 or even 4, only because they use specific test tasks !

In a way it just shows the amazing performance that will come from small future models

Taek3y ago

MPSimmons3y ago

How do you deal with the 'too many tokens' problem? Like, For 13b, max tokens I've gotten it to is 508 - is there a way to increase the number it will accept?

1 more reply

instance3y ago

inferense3y ago

what real use case did you use it for?

2 more replies

raffraffraff3y ago

What about turning the cloud vm off except when you're actually using it?

2 more replies

m3kw93y ago

Most part if you actually need stuff done, you can afford 20$ for OpenAIs proven models.

These local projects are great because maybe eventually they will have a equivalent model that can be run on cheap parts

thelittleone3y ago

Knowledge cut off and sending potentially sensitive IP to a centralised and untrusted third party. This would likely limit the addressable market.

1 more reply

Beltiras3y ago

The knowledge cutoff date is brutal for some applications.

1 more reply

elorant3y ago

I've setup and use Vicuna-13b for text classification, summarization and topic modelling. Works like a charm.

It's also good for math lessons.

jstarfish3y ago

Maybe others' experiences are different, but I find smaller models to work just as well for "reductive" tasks.

Dolly sucks for generating long-form content (not very creative) but if I need a summary or classification, it's quicker and easier to spin up dolly-3b than vicuna-13b.

I suspect OpenAI is routing prompts to select models based on similar logic.

BOOSTERHIDROGEN3y ago

Would like to know how you setup this. A posts would be awesome.

1 more reply

redox993y ago

People are extensively using these models (more specifically the finetuned, uncensored ones) for role playing.

irzzy3y ago

I’m interested in this. Can you elaborate?

1 more reply

iforgotpassword3y ago

jhbadger3y ago

fredguth3y ago

GitHub Copilot is (or once was) a 13b model, according to Nat Friedman in the scale.ai interview. (https://youtu.be/lnufceCxwG0)

xrd3y ago

Can you share more about what you are doing or what you've seen others doing?

1 more reply

rain1OP3y ago

It's just for fun!

These local models aren't as good as Bard or GPT-4.

happycube3y ago

There are two major advantages though - you can retrain them, and they're not on the guardrails that the commercial models have.

1 more reply

logicchains3y ago

LLaMA 65B in my experience is comparable to GPT3.5 or the old Bard for usefulness at most things.

silvervine3y ago

I am aiming to connect it to my fav game engine to make a character speak in generative sentences. I am aiming to do this mainly for fun however.

Zetobal3y ago

We run some llamas to analyze user content.

holoduke3y ago· 23 in thread

elabajaba3y ago

Because they can't do that for a "good consumer price".

eurekin3y ago

3090 has a lot of vram chips on the back though

1 more reply

Tuna-Fish3y ago

It would not be trivial to do.

If the product requires either a new memory or GPU die configuration, it's cost will be very high.

The only type of memory that can support very different VRAM sizes for an efficiently utilized bus of the same size is HBM, and so far that is limited to the very high end.

magicalhippo3y ago

Anandtech has an article on the GDDR6X variant[1] that NVIDIA has in their 3000-cards, where they use a more complex encoding to transmit two bits per clock edge.

I hadn't realized just how insane the bandwidth on the higher-ends cards are, the 3090 being just shy of 1 TB/s, yes, one terrabyte per second...

For comparison a couple of DDR5 sticks[2] will just get you north of 70GB/s...

[1]: https://www.anandtech.com/show/15978/micron-spills-on-gddr6x...

[2]: https://www.anandtech.com/show/17269/ddr5-demystified-feat-s...

1 more reply

q7xvh97o2pDhNrh3y ago

Do you happen to know where Apple's integrated approach falls on this spectrum?

2 more replies

fennecfoxy3y ago

Well yeah I guess binned cards come into play, cheaper binned cards have a narrower bus. It seems there are quite a few models that aren't too heavy on compute but require a tonne of vram.

0xcde4c3db3y ago

Probably because if they take that exact same GPU+VRAM configuration and slap it on a rackmount-optimized board, they can charge AI companies 5-10x the price for it.

jsheard3y ago

1 more reply

vegabook3y ago

fomine33y ago

This must be the reason. AI homelab market is too small while enterprise market will become extremely big.

hackernudes3y ago

It won't work out unless it becomes common to run LLMs locally. Kind of a chicken-and-egg problem so I hope they try it!

the84723y ago

> I don't think there was a market for it before LLMs.

Perhaps asset-streaming (nanite&co) will make this less of an issue, but that's also fairly new.

Do LLM implementations already stream the weights layer by layer or in whichever order they're doing the evaluation or is PCIe bandwidth too limited for that?

2 more replies

layer83y ago

Releasing a new model takes time, and it’s unclear how large the consumer market would actually be. Maybe they’re working on it right now.

kevingadd3y ago

Keep in mind video cards don't use the same kind of RAM as consumer CPUs do, they typically use GDDR or HBM.

dragonwriter3y ago

> Why does AMD or Intel not release a medium performant GPU with minimum 128gb of memory for a good consumer price.

They do. Well, not “medium performant”, but for VRAM-bound tasks they’d still be an improvement over CPUs if you could use them — iGPUs use main memory.

What they don’t have is support for them for popular GPGPU frameworks (though there was a third party CUDA-for-Intel-iGPUs a while ago.)

duxup3y ago

>for a good consumer price

Was there a consumer market for them until recently?

fooker3y ago

That's exactly what the next generation of 'accelerators' will be like.

Whether it will be co-located with a GPU for consumer hardware remains to be seen.

The thing to determine is how essential running LLMs locally is for consumers.

BigTech is pushing hard to make their clouds the only place to run LLMs unfortunately, so unless there is a killer app that is just better locally (like games were for GPUs), this might not change.

boppo13y ago

> unless there is a killer app that is just better locally

Therapy & relationship bots, like the movie 'Her'. It's ugly, but it's coming.

1 more reply

pbhjpbhj3y ago

There's a type of DMA for GPUs to access NVMe on the motherboard, IIRC. Perhaps that is a better solution here?

https://developer.nvidia.com/blog/gpudirect-storage/

boppo13y ago

Isn't pci-e latency dramatically higher than onboard vram?

andromeduck3y ago

Because then memory would be 90% of the BOM.

Kye3y ago

GDDR probably hasn't seen the same cost reduction benefits of volume DDR has.

TaylorAlexander3y ago

1 more reply

rahimnathwani3y ago· 14 in thread

On my system, using `-ngl 22` (running 22 layers on the GPU) cuts wall clock time by ~60%.

My system:

GPU: NVidia RTX 2070S (8GB VRAM)

CPU: AMD Ryzen 5 3600 (16GB VRAM)

Here's the performance difference I see:

CPU only (./main -t 12)

  llama_print_timings:        load time = 15459.43 ms
  llama_print_timings:      sample time =    23.64 ms /    38 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  9338.10 ms /   356 tokens (   26.23 ms per token)
  llama_print_timings:        eval time = 31700.73 ms /    37 runs   (  856.78 ms per token)
  llama_print_timings:       total time = 47192.68 ms

GPU (./main -t 12 -ngl 22)

  llama_print_timings:        load time = 10285.15 ms
  llama_print_timings:      sample time =    21.60 ms /    35 runs   (    0.62 ms per token)
  llama_print_timings: prompt eval time =  3889.65 ms /   356 tokens (   10.93 ms per token)
  llama_print_timings:        eval time =  8126.90 ms /    34 runs   (  239.03 ms per token)
  llama_print_timings:       total time = 18441.22 ms

guardiangod3y ago

I am testing it on an AWS instance and the speedup effect is not as consistent as I hope. The speedup varies between runs.

Intel Xeon Platinum 8259CL CPU @ 2.50GHz 128 GB RAM Tesla T4

  ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin

  llama_print_timings:        load time =  3725.08 ms
  llama_print_timings:      sample time =   612.06 ms /   536 runs   (    1.14 ms per token)
  llama_print_timings: prompt eval time = 13876.81 ms /   259 tokens (   53.58 ms per token)
  llama_print_timings:        eval time = 221647.40 ms /   534 runs   (  415.07 ms per token)
  llama_print_timings:       total time = 239423.46 ms

  ./main -t 12 -m models/gpt4-alpaca-lora-30B-4bit-GGML/gpt4-alpaca-lora-30b.ggml.q5_0.bin -ngl 30

  llama_print_timings:        load time =  7638.95 ms
  llama_print_timings:      sample time =   280.81 ms /   294 runs   (    0.96 ms per token)
  llama_print_timings: prompt eval time =  2197.82 ms /     2 tokens ( 1098.91 ms per token)
  llama_print_timings:        eval time = 112790.25 ms /   293 runs   (  384.95 ms per token)
  llama_print_timings:       total time = 120788.82 ms

rahimnathwani3y ago

Thanks. BTW:

- the model I used was gpt4-x-vicuna-13B.ggml.q5_1.bin

- I used 'time' to measure the wall clock time of each command.

- My prompt was:

  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Instruction:
  Write a long blog post with 5 sections, about the pros and cons of emphasising procedural fluency over conceptual understanding, in high school math education.
  ### Response:

eightysixfour3y ago

You will likely see a bit of a performance gain dropping your threads to 6. I'm on a 3700x and get a regression when using 16 threads instead of the real 8 cores.

rahimnathwani3y ago

Ha! You're right. Honestly, I couldn't recall how many cores I had, and I figured it was faster to run `cat /proc/cpuinfo |grep processor| wc -l` than to google the specs.

It seems like using the right # cores saves ~25%. and then using the GPU for 26 layers saves another ~43%.

Timings for comparison with `--temp 0`:

  -t 6 : 4m10.458s
  -t 12 : 5m34.772s
  -t 6 -ngl 26 : 2m23.278s

2 more replies

xrd3y ago

I'm experimenting with these settings. On my machine (with a 6GB NVidia GEForce, 32 GB of RAM and Intel i5-10400 CPU @ 2.90GHz) I get worse performance offloading to the GPU.

276.23 ms per token without GPU, using only -t 12

296.04 ms per token with GPU using -t 12 -ngl 22

How do you determine the right fit for -t and -ngl? I definitely compiled using make LLAMA_CUBLAS=1

I'm using models/ggml-vic7b-uncensored-q5_1.bin, does the model matter significantly?

rahimnathwani3y ago

> How do you determine the right fit for -t and -ngl?

t: the number of physical cores on your CPU

ngl: just try increasing by one or two, until you start seeing inference crash due to 'out of memory' errors

> i5-10400

You have 6 cores, so try `-t 6`

1 more reply

samstave3y ago

May you please ELI5 what is happening here...

Imagine I am first ever hearing about this, ;; what did you do?

rahimnathwani3y ago

0. Have a PC with an NVidia GPU, running Ubuntu, with the NVidia drivers and CUDA Toolkit already set up.

1. Download the weights for the model you want to use, e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin

2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to compile it with support for CUBLAS (BLAS on GPU).

3. Run the resulting 'main' executable, with the -ngl option set to 18, so that it tries to load 18 layers of the model into the GPU's VRAM, instead of the system's RAM.

2 more replies

cpill3y ago

Will this work with the leaked models or Alpaca?

MPSimmons3y ago

Yes

PaulWaldman3y ago

Any way to know the differences in power consumption?

Tuna-Fish3y ago

Probably significant savings.

rain1OP3y ago

That is a crazy speedup!!

GordonS3y ago

Is it really? Going from CPU to GPU, I would have expected a much better improvement.

2 more replies

s_dev3y ago· 7 in thread

[deleted]

capableweb3y ago

First link: https://github.com/ggerganov/llama.cpp

Which in turn has the following as the first link: https://arxiv.org/abs/2302.13971

Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

s_dev3y ago

>Is it really quicker to ask here than just browse content for a bit, skimming some text or even using Google for one minute?

1 more reply

djbusby3y ago

You gave an awesome answer in 2 minutes! Might be faster than reading!

1 more reply

rain1OP3y ago

llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet.

It is also possible to run fine tuned versions like vicuna with this. I think. Those versions are more focused on answering questions.

haunter3y ago

>I can't tell from the Gist alone

Literally the second line: "llama is a text prediction model similar to GPT-2, and the version of GPT-3 that has not been fine tuned yet"

rain1OP3y ago

I'm sorry! I added this improvement based on that persons question!

s_dev3y ago

Sorry -- I missed that. I'll delete my comments -- obviously I'm just an idiot asking dumb questions that have no value to anybody. I thought I read through it.

1 more reply

marcopicentini3y ago· 5 in thread

What do you use to host these models (like Vicuna, Dolly etc) on your own server and expose them using HTTP REST API? Is there an Heroku-like for LLM models?

I am looking for an open source models to do text summarization. Open AI is too expensive for my use case because I need to pass lots of tokens.

speedgoose3y ago

These days I use FastChat: https://github.com/lm-sys/FastChat

It’s not based on llama.cpp but huggingface transformers but can also run on CPU.

It works well, can be distributed and very conveniently provide the same REST API than OpenAI GPT.

itake3y ago

Do you know how well it performs compared to llama.cpp?

1 more reply

rain1OP3y ago

I haven't tried that but https://github.com/abetlen/llama-cpp-python and https://github.com/r2d4/openlm exists

inhumantsar3y ago

month133y ago

https://bellard.org/ts_server/ may be what you are after.

You can run open-source models, but the software itself is closed-source and free for non-commercial use.

mozillas3y ago· 5 in thread

I ran the 7B Vicuna (ggml-vic7b-q4_0.bin) on a 2017 MacBook Air (8GB RAM) with llama.cpp.

Worked OK for me with the default context size. 2048, like you see in most examples was too slow for my taste.

koheripbal3y ago

Given the current price (mostly free) off public llms I'm not sure what the use case of running out at home are yet.

OpenAIs paid GPT4 has few restrictions and is still cheap.

... Not to mention GPT4 with browsing feature is vastly superior to any home of the models you can run at home.

toxik3y ago

The point for me personally is the same as why I find it so powerful to self host SMTP, IMAP, HTTP. It’s in my hands, I know where it all begins and ends. I answer to no one.

For LLMs this means I am allowed their full potential. I can generate smut, filth, illegal content of any kind for any reason. It’s for me to decide. It’s empowering, it’s the hacker mindset.

2devnull3y ago

sagarm3y ago

I think it's mostly useful if you want to do your own fine tuning, or the data you are working with can't be sent to a third party for contractual, legal, or paranoid reasons.

1 more reply

int_19h3y ago

GPT4 API is still not universally available, for starters.

syntaxing3y ago· 4 in thread

rain1OP3y ago

Tell us how it goes! Try different numbers of layers if needed.

A good place to dig for prompt structures may be the 'text-generation-webui' commit log. For example https://github.com/oobabooga/text-generation-webui/commit/33...

int_19h3y ago

guardiangod3y ago

I got the alpaca 65B GGML model to run on my 64GB ram laptop. No GPU required if you can tolerate the 1 token per 3 seconds rate.

syntaxing3y ago

Supposedly the new update with GPU offloading will bring that up to 10 tokens per second! 1 token per second is painfully slow, that’s about 30s for a sentence.

1 more reply

tarr113y ago· 4 in thread

What is the state of the art on evaluating the accuracy of these models? Is there some equivalent to an “end to end test”?

It feels somewhat recursive since the input and output are natural language and so you would need another LLM to evaluate whether the model answered a prompt correctly.

tikkun3y ago

https://chat.lmsys.org/?arena (Click 'leaderboard')

klysm3y ago

It’s going to be very difficult to come up with any rigorous structure for automatically assessing the outputs of these models. They’re built using effectively human grading of the answers

sroussey3y ago

Llama cpp and others use perplexity:

https://huggingface.co/docs/transformers/perplexity

RockyMcNuts3y ago

dinobones3y ago· 4 in thread

What is HN’s fascination with these toy models that produce low quality, completely unusable output?

Is there a use case for them I’m missing?

Additionally, don’t they all have fairly restrictive licenses?

tbalsam3y ago

I never thought I'd see the day when a 13B model was casually referred to in a comments section as a "toy model".

az2263y ago

2 more replies

andrewmcwatters3y ago

Start using it for tasks and you'll find limitations very quickly. Even ChatGPT excels at some tasks and fails miserably at others.

1 more reply

Zetobal3y ago

Maybe you forgot for what the H stands in HN... playful curiousity.

yawnxyz3y ago· 3 in thread

Would love to run a bunch of models on the machine without dripping $$ to OpenAI, Modal or other providers...

vonseel3y ago

[1] https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...

I'll let others chime in but you could still probably build something really powerful within your budget that is able to run various AI tasks.

logicchains3y ago

You can get around 4-5 tokens per second on the 65B LLaMA with a 32 core 256GB ram Ryzen CPU, not sure how much it costs to build but can rent one from Hetzner for around two hundred bucks a month.

Joeri3y ago

There are some threads with hardware recommendations in the LocalLLaMa subreddit.

Here’s a recent one:

https://www.reddit.com/r/LocalLLaMA/comments/13f5gwn/home_ll...

bitL3y ago· 3 in thread

How about reloading parts of the model as the inference progresses instead of splitting it into GPU/CPU parts? Reloading would be memory-limited to the largest intermediate tensor cut.

regularfry3y ago

bitL3y ago

moffkalast3y ago

The Tensor Reloaded, starring Keanu Reeves

akulbe3y ago· 3 in thread

I've only ever been a consumer of ChatGPT/Bard. Never set up any LLM stuff locally, but the idea is appealing to me.

I have a ThinkStation P620 w/ThreadRipper Pro 3945WX (12c24t) with a GTX 1070 (and a second 1070 I could put in there) and there's 512GB of RAM on the box.

Does this need to be bare metal, or can it run in VM?

I'm currently running RHEL 9.2 w/KVM (as a VM host) with light usage so far.

xrd3y ago

Do you really have 512 GB of RAM on that box?

akulbe3y ago

Yes. I really do. Do you need to see pictures to believe it?

davidy1233y ago

You can run it with CPU or GPU in the VM, in the latter case you'd need to pass the GPU(s through, which depends what kind of VM you're using.

peatmoss3y ago· 2 in thread

From skimming, it looks like this approach requires CUDA and thus is Nvidia only.

Anyone have a recommended guide for AMD / Intel GPUs? I gather the 4 bit quantization is the special sauce for CUDA, but I’d guess there’d be something comparable for not-CUDA?

rain1OP3y ago

4-bit quantization is to reduce the amount of VRAM required to run the model. You can run it 100% on CPU if you don't have CUDA. I'm not aware of any AMD equivalent yet.

amelius3y ago

Looks like there are several projects that implement the CUDA interface for various other compute systems, e.g.:

https://github.com/ROCm-Developer-Tools/HIPIFY/blob/master/R...

https://github.com/hughperkins/coriander

I have zero experience with these, though.

2 more replies

hhh3y ago· 2 in thread

Instructions are a bit rough. The Micromamba thing doesn’t work, doesn’t say how to install it… you have to clone llama.cpp too

rain1OP3y ago

Apologies for that. I've added some extra micromamba setup commands that I should have included before!

I've also added the git clone command, thank you for the feedback

hhh3y ago

Appreciate it! This is much better!

dclowd99013y ago· 2 in thread

Has anyone tried running encryption algorithms through these models? I wonder if it could be trained to decrypt.

Hendrikto3y ago

That would be very surprising, given that any widely used cryptographic encryption algorithm has been EXTENSIVELY cryptanalyzed.

ML models are essentially trained to recognize patterns. Encryption algorithms are explicitly designed to resist that kind of analysis. LLMs are not magic.

dclowd99013y ago

2 more replies

blendergeek3y ago· 2 in thread

Is there a way to run any of these with only 4GB of VRAM?

washadjeffmad3y ago

Assuming an nvidia GPU and requisite system memory, use llama.cpp compiled with cublas support, then run with the -ngl [n layers] option. You'll need a model quantized after May 12 to work with this.

The smallest GPU-only 7B 4-bit model requires 8GB VRAM, so it's either do CPU only or use the GPU offload above.

blendergeek3y ago

Thank you! I'll give it a try.

anshumankmr3y ago· 1 in thread

How long before it runs on a 4 gig card?

rain1OP3y ago

You can offload only 10 layers or so if you want to run on a 4GB card

MuffinFlavored3y ago· 1 in thread

How many "B" (billions of parameters) is ChatGPT GPT-4?

sciolist3y ago

Information about GPT-4 was not released

ACV0013y ago· 1 in thread

olabyne3y ago

The phone part is already there ! https://mlc.ai/mlc-llm/ (granted, this is only a 7b-model running with 4bits)

tikkun3y ago