You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.
You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.
The more you run in RAM the slower the inference.
No comments yet.