undefined | Better HN

0 pointsdiggan2y ago0 comments

Interesting, thanks for sharing that! For curiosity, what kind of performance you get with that setup + mixtral 8x7b in terms of tokens/second?

0 comments

ingenieroariel2y ago

I just did:

./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"

I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.

The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely.

j / k navigate · click thread line to collapse

0 pointsdiggan2y ago0 comments

Interesting, thanks for sharing that! For curiosity, what kind of performance you get with that setup + mixtral 8x7b in terms of tokens/second?

0 comments

ingenieroariel2y ago

I just did:

./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"

I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.

j / k navigate · click thread line to collapse