Llama2.c is a toy and not optimized, it's matmul is a for loop and C and it relies entirely on the compiler for speedup. You'd need to compare it with llama.cpp for anything credible.
I noticed that it says mojo is using six threads. Is that across cores or is it something else? Do you know what it's running in different threads?
I also saw some discussion in the llama2.c issues about using BLAS for the matmul. I'd be curious to know what speedup this gives.