1. Download the weights for the model you want to use, e.g. gpt4-x-vicuna-13B.ggml.q5_1.bin
2. Clone the llama.cpp repo, and use 'make LLAMA_CUBLAS=1' to compile it with support for CUBLAS (BLAS on GPU).
3. Run the resulting 'main' executable, with the -ngl option set to 18, so that it tries to load 18 layers of the model into the GPU's VRAM, instead of the system's RAM.
I think you need to quantize the model yourself from the float/huggingface versions. My understanding is that the quantization formats have changed recently. and old quantized models no longer work.
The repo has now been updated with requantized models that work with the latest version, so you don't need to do that any more.
https://huggingface.co/TheBloke/gpt4-x-vicuna-13B-GGML/commi...
0. Make sure you have a PC with an NVidia GPU running Ubuntu, and you have already set up the NVidia drivers and CUDA Toolkit.
1. Download the weights for the model you want to use, such as "gpt4-x-vicuna-13B.ggml.q5_1.bin". You can use a web browser or a command-line tool like `wget` to download the file. For example: ```bash wget https://example.com/path/to/gpt4-x-vicuna-13B.ggml.q5_1.bin ```
2. Clone the llama.cpp repository using `git`. Make sure you have `git` installed on your system. Open a terminal and navigate to the directory where you want to clone the repository, then execute the following command: ```bash git clone https://github.com/example/llama.cpp.git ```
3. Navigate into the cloned `llama.cpp` directory: ```bash cd llama.cpp ```
4. Compile the code with support for CUBLAS (BLAS on GPU) by running the following command: ```bash make LLAMA_CUBLAS=1 ```
5. After the compilation is successful, you can run the resulting `main` executable with the `-ngl` option set to 18. This will attempt to load 18 layers of the model into the GPU's VRAM instead of the system's RAM. Use the following command to run the executable: ```bash ./main -ngl 18 ```
Make sure to replace the example URLs and filenames with the actual ones you are using.
Please note that the steps provided are based on the information you provided, and it assumes you have the necessary dependencies and permissions to perform these actions. If you encounter any issues or errors during the process, please provide more details, and I'll be happy to assist you further.
-
Certainly! Here are the detailed steps for the tasks you mentioned in a table format:
| Step | Description | |------|-----------------------------------------------------------------------------------------------------------------------| | 0 | Make sure you have a PC with an NVidia GPU running Ubuntu, and you have already set up the NVidia drivers and CUDA Toolkit. | | 1 | Download the weights for the model you want to use, such as "gpt4-x-vicuna-13B.ggml.q5_1.bin". | | 2 | Clone the llama.cpp repository using `git`. | | 3 | Compile the code with support for CUBLAS (BLAS on GPU) using `make LLAMA_CUBLAS=1`. | | 4 | Run the resulting `main` executable with the `-ngl` option set to 18. |
Here's how the commands can be represented in a table format:
| Step | Command | |------|---------------------------------------------------------------------------------------------| | 1 | `wget https://example.com/path/to/gpt4-x-vicuna-13B.ggml.q5_1.bin` | | 2 | `git clone https://github.com/example/llama.cpp.git` | | 3 | `make LLAMA_CUBLAS=1` | | 4 | `./main -ngl 18` |
Make sure to replace the example URLs and filenames with the actual ones you are using.
Please note that the steps provided are based on the information you provided, and it assumes you have the necessary dependencies and permissions to perform these actions. If you encounter any issues or errors during the process, please provide more details, and I'll be happy to assist you further.