It's the Huggingface transformers library which is implemented in pytorch.
In terms of speed, yes running fp16 will indeed be faster with vanilla gpu setup. However most people are running 4bit quantized versions, and the GPU quantization landscape as been a mess (GPTQ-for-llama project). llama.cpp has taken a totally different approach, and it looks like they are currently able to match native GPU perf via cuBLAS with much less effort and brittleness.