./llama-bench -m /data/ai/models/llm/gguf/mistral-7b-instruct-v0.1.Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | pp512 | 242.69 ± 0.99 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.24 B | ROCm | 99 | tg128 | 15.33 ± 0.03 |
build: e11bd856 (3620)Let’s be honest, it might not be awful but it’s a nonstarter for encouraging local LLM adoption and most will prefer to pay to pay pennies for api access instead (friction aside).
However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:
- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)
- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster
- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context
- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache
I don't think so. Humans scan for keywords very often. No body really reads every word. Faster than reading speed inference is definitely beneficial.