Llama2 implementation on Mojo runs at high performance (opens in new tab)

(twitter.com)

2 pointsjuliangamble2y ago4 comments

4 comments

3 comments · 1 top-level

version_five2y ago· 2 in thread

Is there a real link somewhere? What flags was llama2.c built with for the comparison? (edit, it's build with `make runfast` which doesn't parallelize across cores... I wonder if that's part of it. I also wonder if BLAS is another reason, I assume mojo has some accelerated linear algebra library.

Llama2.c is a toy and not optimized, it's matmul is a for loop and C and it relies entirely on the compiler for speedup. You'd need to compare it with llama.cpp for anything credible.

atairov2y ago

Hi. Thanks for commenting on this. You're correct llama2.c was built with runfast that doesn't execute on cores via OMP. This made comparison fair, since in Mojo the parallelize helper wasn't used as well. I think one of the reason why llama2.c isn't performing better, it's because so far it doesn't have SIMD instructions support. And it seems that the SIMD implementation could make overall complexity of run.c quite bad. While the essential purpose of llama2.c was determined as education. In the other side llama2.mojo as Mojo ecosystem also is in it's early stages. I'm researching how to implement full set of improvements offered by Mojo.

version_five2y ago

Thanks for clarifying. I'm interested in what C is leaving on the table in terms of performance. I saw your github implementation, I'd suggest you try submitting it as a show HN if you didn't already. (Looks like you did submit it, try it again with Show HN: and maybe more people will notice).

I noticed that it says mojo is using six threads. Is that across cores or is it something else? Do you know what it's running in different threads?

I also saw some discussion in the llama2.c issues about using BLAS for the matmul. I'd be curious to know what speedup this gives.

1 more reply

j / k navigate · click thread line to collapse