Looked at the author and realized it's Alfonso from the Graal team -- makes sense.
I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.
Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py
I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.
Is there any indication that it won't go from there to a final release soon?
Any abstraction for GPGPU or shaders programming?
But it's a research project.