perfect for learning how ml frameworks work under the hood :)
We had all these issues back in 2006 when my group was implementing autograd for C++ and, later, a computer algebra system called Axiom. We knew it'd be ideal for NN; I was trying to build this out for my brother who was porting AI models to GPUs. (This did not work in 2006 for both HW & math reasons.)
Do you mean the method theano is using? Anyway, the performance bottleneck often lies in matrix multiplication or 2D-CNN (which can be reduced to matmul). Compiler autograd wouldn't save much time.
i think you might be interested in MLIR/IREE: https://github.com/openxla/iree
Edit: I asked this before I read the design decisions. Reasoning is, as far as I understand, that for simplificity no in-place operations hence accumulating it done on a new tensor.
https://github.com/sueszli/autograd.c/blob/main/src/autograd...
i wonder whether there is a more clever way to do this without sacrificing simplicity.
if you are interested in the technical details, the design specs are here: https://github.com/sueszli/autograd.c/blob/main/docs/design....
if you are working on similar mlsys or compiler-style projects and think there could be overlap, please reach out: https://sueszli.github.io/