Llama2.java: Karpathy's llama2.c ported to Java (opens in new tab)

(github.com)

33 pointsmukel2y ago18 comments

18 comments

16 comments · 5 top-level

gavinray2y ago· 3 in thread

The Java code is impressively written, using newer features like MemorySegment.

Looked at the author and realized it's Alfonso from the Graal team -- makes sense.

I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.

mukelOP2y ago

Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.

kurhan2y ago

Also new virtual threads might be beneficial. I was experimenting using Vector api for matrix multiplication once and effect was pretty good.

mike_hearn2y ago

Virtual threads shouldn't help as the program isn't I/O or wait bottlenecked. It's a pure computation, so it's all about vectorization here.

shortrounddev22y ago· 3 in thread

How you all used these things for anything useful? I can't get them to give useful results on my 3060 8gb. If I wanted to get decent results I think I'd need to rent a GPU node somewhere, but chatGPT is still free

SushiHippie2y ago

The 4bit quantized 13B models, give really decent answers (not as good as gpt4, but often as good as gpt 3)

nmfisher2y ago

I know it might be asking a lot, but it would be great if someone could put up a HF space so I could try all the various flavours/sizes.

lazylion22y ago

/r/LocalLLaMA/

1 more reply

jiehong2y ago· 3 in thread

This makes me wonder: what’s the status of GPU programming on the JVM?

Any abstraction for GPGPU or shaders programming?

pjmlp2y ago

Besides TornadoVM,

http://javagl.de/jcuda.org/

https://dragan.rocks/software/

https://blogs.oracle.com/javamagazine/post/programming-the-g...

mike_hearn2y ago

See here: https://www.tornadovm.org/

But it's a research project.

jfumero2y ago

To quote Gary Frost (creator of Aparapi), TornadoVM is the state-of-the-art right now. He mentioned this at JVMLS 2023. Hopefully the videos will be available soon from this link: https://openjdk.org/projects/mlvm/jvmlangsummit/

mukelOP2y ago· 2 in thread

A Java port of llama2.c that performs very close to C on large models. Llama 2 7B runs at a whooping 1.6 tokens/s.

mike_hearn2y ago

Hey man, awesome stuff. Surely any JIT compiler will struggle to vectorize something using IntStream.range, though? Looking at matmul, I'd not expect that to be auto-vectorized. The Panama API can be used to do a matmul vectorization, too bad it seems to never launch.

mwcampbell2y ago

Panama is now in its third preview in the soon-to-be-released JDK 21:

https://openjdk.org/jeps/442

Is there any indication that it won't go from there to a final release soon?

1 more reply

atairov2y ago

Thanks for sharing this! It's great to have a reference implementation written on java lang. With given original simplicity it's really easy to follow llama architecture logic.

Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py

I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.

j / k navigate · click thread line to collapse

18 comments

16 comments · 5 top-level

gavinray2y ago· 3 in thread

The Java code is impressively written, using newer features like MemorySegment.

Looked at the author and realized it's Alfonso from the Graal team -- makes sense.

I wonder whether the "matmul" code could be further optimized with the Vector API and SIMD.

mukelOP2y ago

kurhan2y ago

Also new virtual threads might be beneficial. I was experimenting using Vector api for matrix multiplication once and effect was pretty good.

mike_hearn2y ago

Virtual threads shouldn't help as the program isn't I/O or wait bottlenecked. It's a pure computation, so it's all about vectorization here.

shortrounddev22y ago· 3 in thread

SushiHippie2y ago

The 4bit quantized 13B models, give really decent answers (not as good as gpt4, but often as good as gpt 3)

nmfisher2y ago

I know it might be asking a lot, but it would be great if someone could put up a HF space so I could try all the various flavours/sizes.

lazylion22y ago

/r/LocalLLaMA/

1 more reply

jiehong2y ago· 3 in thread

This makes me wonder: what’s the status of GPU programming on the JVM?

Any abstraction for GPGPU or shaders programming?

pjmlp2y ago

Besides TornadoVM,

http://javagl.de/jcuda.org/

https://dragan.rocks/software/

https://blogs.oracle.com/javamagazine/post/programming-the-g...

mike_hearn2y ago

See here: https://www.tornadovm.org/

But it's a research project.

jfumero2y ago

mukelOP2y ago· 2 in thread

A Java port of llama2.c that performs very close to C on large models. Llama 2 7B runs at a whooping 1.6 tokens/s.

mike_hearn2y ago

mwcampbell2y ago

Panama is now in its third preview in the soon-to-be-released JDK 21:

https://openjdk.org/jeps/442

Is there any indication that it won't go from there to a final release soon?

1 more reply

atairov2y ago

Thanks for sharing this! It's great to have a reference implementation written on java lang. With given original simplicity it's really easy to follow llama architecture logic.

Just in case if anyone interested in Python version, I spend some time on weekend and ported it to pure python -- https://github.com/tairov/llama2.py

I never knew that it would take about 500 lines of core part code to implement inference for such a cutting edge AI technology.

j / k navigate · click thread line to collapse