undefined | Better HN

0 pointslhl2y ago0 comments

For more standardized speed benchmarking, I'd recommend benchmarking with something like `-c 128 -n 1920 --ignore-eos` (and skipping the `-p` entirely). The number that you care most about would be the "eval time" tokens/second - it tends to get slower as context increases, which is why it's sort of important to standardize. I think 2-3 t/s is about what's expected (Threadripper Pro 5000 w/ 8 channels of DDR-3200 should have an expected theoretical top memory bandwidth of 204.8 GB/s - memory bandwidth is the main limiting factor for most systems for LLM inferencing).

0 comments

No comments yet.