Show HN: Autograd.c – A tiny ML framework built from scratch (opens in new tab)

(github.com)

85 pointssueszli6mo ago13 comments

built a tiny pytorch clone in c after going through prof. vijay janapa reddi's mlsys book: mlsysbook.ai/tinytorch/

perfect for learning how ml frameworks work under the hood :)

13 comments

10 comments · 3 top-level

spwa46mo ago· 6 in thread

Cool. But this makes me wonder. This negates most of the advantages of C. Is there a compiler-autograd "library"? Something that would compile into C specifically to execute as fast as possible on CPUs with no indirection at all.

thechao6mo ago

At best you'd be restricted to the forward mode, which would still double stack pressure. If you needed reverse mode you'd need 2x stack, and the back sweep over the stack based tape would have the nearly perfectly unoptimal "grain". If you allows the higher order operators (both push out and pull back), you're going to end up with Jacobians & Hessians over nontrivial blocks. That's going to need the heap. It's still better than an unbounded loop tape, though.

We had all these issues back in 2006 when my group was implementing autograd for C++ and, later, a computer algebra system called Axiom. We knew it'd be ideal for NN; I was trying to build this out for my brother who was porting AI models to GPUs. (This did not work in 2006 for both HW & math reasons.)

spwa46mo ago

Why not recompile every iteration? Weights are only updated at the end of the batch size at the earliest, and for distributed training, n batch sizes at the fastest, and generally only at the end of an iteration. In either case the cost of recompiling would be negligeable, no?

1 more reply

attractivechaos6mo ago

> Is there a compiler-autograd "library"?

Do you mean the method theano is using? Anyway, the performance bottleneck often lies in matrix multiplication or 2D-CNN (which can be reduced to matmul). Compiler autograd wouldn't save much time.

marcthe126mo ago

We would need to mirror jax architecture more. Since the jax is sort of jit arch wise. Basically you somehow need a good way to convert computational graph to machine code while at compile time also perform a set of operations on the graph.

justinnk6mo ago

I believe Enzyme comes close to what you describe. It works on the LLVM IR level.

https://enzyme.mit.edu

sueszliOP6mo ago

a heap-free implementation could be a really cool direction to explore. thanks!

i think you might be interested in MLIR/IREE: https://github.com/openxla/iree

PartiallyTyped6mo ago· 1 in thread

Any reason for creating a new tensor when accumulating grads over updating the existing one?

Edit: I asked this before I read the design decisions. Reasoning is, as far as I understand, that for simplificity no in-place operations hence accumulating it done on a new tensor.

sueszliOP6mo ago

yeah, exactly. it's for explicit ownership transfer. you always own what you receive, sum it, release both inputs, done. no mutation tracking, no aliasing concerns.

https://github.com/sueszli/autograd.c/blob/main/src/autograd...

i wonder whether there is a more clever way to do this without sacrificing simplicity.

sueszliOP6mo ago

woah, this got way more attention than i expected. thanks a lot.

if you are interested in the technical details, the design specs are here: https://github.com/sueszli/autograd.c/blob/main/docs/design....

if you are working on similar mlsys or compiler-style projects and think there could be overlap, please reach out: https://sueszli.github.io/

1 more reply

j / k navigate · click thread line to collapse

13 comments

10 comments · 3 top-level

spwa46mo ago· 6 in thread

thechao6mo ago

spwa46mo ago

1 more reply

attractivechaos6mo ago

> Is there a compiler-autograd "library"?

Do you mean the method theano is using? Anyway, the performance bottleneck often lies in matrix multiplication or 2D-CNN (which can be reduced to matmul). Compiler autograd wouldn't save much time.

marcthe126mo ago

justinnk6mo ago

I believe Enzyme comes close to what you describe. It works on the LLVM IR level.

https://enzyme.mit.edu

sueszliOP6mo ago

a heap-free implementation could be a really cool direction to explore. thanks!

i think you might be interested in MLIR/IREE: https://github.com/openxla/iree

PartiallyTyped6mo ago· 1 in thread

Any reason for creating a new tensor when accumulating grads over updating the existing one?

Edit: I asked this before I read the design decisions. Reasoning is, as far as I understand, that for simplificity no in-place operations hence accumulating it done on a new tensor.

sueszliOP6mo ago

yeah, exactly. it's for explicit ownership transfer. you always own what you receive, sum it, release both inputs, done. no mutation tracking, no aliasing concerns.

https://github.com/sueszli/autograd.c/blob/main/src/autograd...

i wonder whether there is a more clever way to do this without sacrificing simplicity.

sueszliOP6mo ago

woah, this got way more attention than i expected. thanks a lot.

if you are interested in the technical details, the design specs are here: https://github.com/sueszli/autograd.c/blob/main/docs/design....

if you are working on similar mlsys or compiler-style projects and think there could be overlap, please reach out: https://sueszli.github.io/

1 more reply

j / k navigate · click thread line to collapse