undefined | Better HN

0 pointsTaylorAlexander3y ago0 comments

Following up. After rebooting in to GUI that was enough to get it to fit, I guess xorg just accumulated some cruft in my last boot. So I can run it alongside gnome.

nvidia-smi reports this model is using 15475MiB after changing the max batch size from 32 to 8 (see link in above post)

As others have stated someone may have injected unknown code in to the pickled checkpoint, so I recommend running this in docker. I use this command to run the docker image after getting nvidia docker stuff configured.

docker run --runtime=nvidia -it --mount type=bind,source=/MY_LLAMA_SOURCE_PATH,target=/llama --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04

Then install the necessary dependencies in that container (obv could make a dockerfile), stick your model as well as tokenizer files (from the root dir in download) in to some directory (here models/LLaMA-7B) and run this:

torchrun example.py --ckpt_dir ./models/LLaMA-7B --tokenizer_path ./models/LLaMA-7B/tokenizer.model

enjoy! the 7B parameter model is not amazing from my initial testing. I am very curious if larger models can be run on 1 GPU by sequentially loading checkpoints. I don't know how all this stuff is organized.

Example output below.

Prompt:

Please respond to this question: As a large language model, what are three things that you find most important?

Output:

To get students to analyze what they are doing in their learning

To get students to analyze what they are doing in their learning so that they can find the best practices

To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning

I don't think this question really made much sense because the sentence in the question is incomplete

To get students to analyze what they are doing in their learning so that they can find the best practices and then to make changes to those practices so that they can improve their learning

I don't think this question really made much sense because the sentence in the question is incomplete.

I think this question is in fact a good question, even though the initial sentence is incomplete, and I do think it makes sense.

I think this question is in fact a good question, even though the initial sentence is incomplete, and I do think it makes sense. (I think it is a good question but I am not sure it makes sense).

0 comments

5 comments · 2 top-level

markasoftware3y ago· 3 in thread

I was able to run 7B on a CPU, inferring several words per second: https://github.com/markasoftware/llama-cpu

jpieper3y ago

Beginner pytorch user here... it looks like it is using only one CPU on my machine. Is it feasible to use more than one? If so, what options/env vars/code change are necessary?

markasoftware3y ago

Perhaps try setting `OMP_NUM_THREADS`, for example `OMP_NUM_THREADS=4 torchrun ...`.

But on my machine, it automatically used all 12 available physical cores. Setting OMP_NUM_THREADS=2 for example lets me decrease the number of cores being used, but increasing it to try and use all 24 logical threads has no effect. YMMV.

TaylorAlexanderOP3y ago

nice!

byteknight3y ago

Looks like you need multiple GPUs for anything >7B.

https://github.com/facebookresearch/llama/issues/55#issuecom...

j / k navigate · click thread line to collapse