undefined | Better HN

0 pointsMrYellowP2y ago0 comments

How does the 7B model use only 512 megabytes? That's not possible?

Is it using mmap and concealing the actual memory usage?

0 comments

cfn2y ago

I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it:

main: build = 942 (4f6b60c) main: seed = 1691400051 llama.cpp: loading model from /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-05 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 4820.60 MB (+ 256.00 MB per state) llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.84 MB

You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.

This is the command: ./main -eps 1e-5 -m /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin -t 13 -p \ "[INST] <<SYS>>You are a helpful and concise assistant<</SYS>>Write a c++ function that calculates RMSE between two double lists using CUDA. Don't explain, just write out the code.[/INST]"

dTal2y ago

Yeah that's using over 5 gigabytes, not 400 megabytes. Your process monitor is inaccurate; the memory used doesn't "count" because it's disk backed and the kernel is free to discard memory pages if it really needs the memory because it can always load it back from disk. But every time it does that you need to wait for the slow disk to read it back in again.

cfn2y ago

It is strange that it does that given that there's plenty of free memory available in the system (it has 256Gb of RAM and wasn't running anything else).

1 more reply

j / k navigate · click thread line to collapse

0 comments

cfn2y ago

I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it:

You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.

dTal2y ago

cfn2y ago

It is strange that it does that given that there's plenty of free memory available in the system (it has 256Gb of RAM and wasn't running anything else).

1 more reply

j / k navigate · click thread line to collapse