"99% written by DeepSeek-R1" according to the author.
To be fair torch didn't try very hard optimizing on CPU either.
[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.
Fwiw, there are always many attempts at optimizing code (assembly etc). This is good! Great to try new techniques. However, you get what you constrain. So I've seen optimized code that drops checks that the compiler authors say are required in the standard. So, if you don't explicitly tell your optimizer "this is a case I care about, this is the desired output" it will ignore that case.
Did we find a faster implementation than the compiler creates? Well, I mean, sure, if you don't know why the compiler is doing what is doing