Tinygrad: A simple and powerful neural network framework (opens in new tab)

(tinygrad.org)

434 pointsmasterofsome3y ago142 comments

142 comments

101 comments · 26 top-level

matesz3y ago· 18 in thread

If anybody is dealing with procrastination watch George Hotz live streaming 10h straight working on this library [1][2]. Does he take some supplements to do this? There is even 19.5h stream [3].

Actually I have local obs setup to record myself, just instead of streaming I do recordings for my own inspection. Important part is to do the inspection after. It works wonders.

[1] https://youtu.be/GXy5eVwnL_Q

[2] https://m.youtube.com/watch?v=Cb2KwcnDKrk

[3] no joke, 19.5h stream https://www.youtube.com/watch?v=xc0jGZYFQLQ

terafo3y ago

It isn't 19.5 hour stream, I went and randomly clicked on couple of timestamps and stumbled upon[1]. So it's two almost-10-hour-long streams put together because they are thematically similar.

[1] https://youtu.be/xc0jGZYFQLQ?t=34333

rl33y ago

>Does he take some supplements to do this? There is even 19.5h stream [3]

It's probably the fact he has an audience. I can't speak for him, but that'd sure as hell light a fire under my ass—or at least significantly reduce procrastination.

Also, I find in periods where I've worked ~17 hours straight that the tiredness calms by brain to the point I'm normal and makes focus easy, albeit difficult in a different way due to fatigue. There's a weird drone zone there that's nice. Not something to make a habit of, though.

kramerger3y ago

Tried to watch some videos but the high resolution/tiny font made it hard to watch.

I used to watch scanlime do 8 hour sw/hw sessions, really hope she comes back soon.

I have watched Brandon Falk do 10-11 hours of rust programming, although he sometimes take a break to play games for 4-5 hours (while in stream)

ren_engineer3y ago

If you enjoy what you are doing it's pretty easy to work on something that long, I've had gaming sessions last as long back in the day with friends and some of those games are as demanding in terms of focus as programming.

External motivation of having an audience would also help

blt3y ago

Factorio?

pyinstallwoes3y ago

If it's not Adderall I don't know. But, if I've ever focused for that long it's been because of Ritalin or Adderall.

kklisura3y ago

I think it's combination of: 1) he's really passionate about what he's doing 2) he sees the problem as real challenge 3) he doesn't have corporate structure on his back giving him deadlines and pressure

1 more reply

nextlevelwizard3y ago

Is 10 hours really _that_ strange? You are (hopefully) focusing 8 hours "straight" during work _every day_.

If you watch Hotz's streams he takes small breaks to talk with chat and to meme around (just like everyone else during their work days) and he eats lunch and whatever (again just like everyone else).

What I'm trying to say is that Hotz's isn't a superman on Adderall he is just working on stuff he is excited about.

6 more replies

_0nel3y ago

yeah man he does, but he is crazy genius like Nikola Tesla or something and I’m not

langsoul-com3y ago

What's the file size for your recordings? 5 hours of 720p would be huge.

albert_e3y ago

YouTube lets you livestream from OBS but mark the stream as private.

You get unlimited free storage of your streams for your personal use that way without the need for any local storage at all.

I haven't come across any limits or downsides to this yet but happy to be corrected.

matesz3y ago

The bigger problem is recording taking too much cpu. That's why I don't record full work day, just chunks, whenever I feel I procrastinate. Youtube is an option here, I've tested it however cpu problem doesn't go away.

My plan is to make this obs-ndi plugin work on ubuntu, so I will be able to record on ubuntu to take the load off of mac which is my primary laptop.

PS. I forgot to read obs-ndi instructions properly, it works ok so now I can delegate regording to second laptop

terafo3y ago

It wouldn't be huge, I once recorded a week of me using my pc(so around 80 hours in total), and it was sub-100 gigs. It was 1080p with decent quality, don't remember FPS though.

jjallen3y ago

What do you do with your own recordings afterwards? How do they help you?

matesz3y ago

I just quickly loop through them and categorieze chunks of time, just to see how I work. I have iphone as input, put on the table on the right which also captures my posture - I slouch almost all the time.

There is a problem however - I work on mac m1 and obs recordings take full 4 out of 8 cores so actually I am recording only when I notice I am starting to procrastinate. All recordings I remove afterwords to save up space. Obs is turned on all the time though.

I wanted to use obs-ndi to combine output from another laptop, but I have some issues with it so I just record mac atm. I also have powerfull desktop on the side and can ssh between all those by name, but desktop is noisy so it's off most of the time. Also there is raspberry pi with simple script with which I can turn on desktop remotely via Wait On Lan udp packet, dns handled via https://www.noip.com with which my router has an integration, but I actually never used it. I've done this setup to justify purchasing this powerful desktop in the first place :) humble brag, I know.

Here is screenshot of obs recording with sneak peak of my room https://imgur.com/a/m92R7Bx

RobertDeNiro3y ago

> Does he take some supplements to do this?

I think its just hyperfocus.

taneq3y ago

Yeah but what’s he supposed to be doing during that time? :P

locuscoeruleus3y ago

What do you inspect on your recordings?

DeathArrow3y ago· 13 in thread

I believe neural networks are over hyped sometimes.

They are not always the best tool for the job. There are lots of other ML techniques such as SVM, naive Bayes, k-nearest neighbor, decision tree, logistic regression, random forest etc. nobody is using because they lack the hype factor.

If something lacks some keywords like neural network, deep learning, reinforced learning, than it is deemed not cool.

learndeeply3y ago

I can't think of anything that neural nets can't beat, except small tabular data with boosted decision trees. Can you give some examples?

zelphirkalt3y ago

Explicability is a big part of it It is often worth being a percent less accurat but having an explainable result.

1 more reply

jstx13y ago

(I don't really agree with GP's point but for the sake of answering your question)

1. Collaborative filtering based on a sparse dataset of implicit interactions.

2. Many time series applications.

1 more reply

niemandhier3y ago

Small data problems, where’re never the less have a really good idea of how things are causally related.

insane_dreamer3y ago

> we often use ML over DL in scientific analysis because we need models that can be inspected/explained not just results

> also, DL generally requires more data whereas you can get by with ML on less data if you have domain knowledge

patrick4513y ago

The black box nature of a neural net is a problem. For model based design, a bit more accuracy out of a black box doesn't really help when you need, for example, state space matrices in a control design.

jack_pp3y ago

I'm no expert but can you show how those techniques can be used to solve the same problems NNs can? Like SOTA image recognition, chess / go, STT, TTS etc?

DeathArrow3y ago

>I'm no expert but can you show how those techniques can be used to solve the same problems NNs can?

Sentiment analysis, classification.

1 more reply

minimaxir3y ago

The problems where traditional ML works best and the problems where Transformers or ConvNets work best are usually two different domains.

AI is not a buzzword.

DeathArrow3y ago

>The problems where traditional ML works best and the problems where Transformers or ConvNets work best are usually two different domains.

Yes and we are using NN for everything.

1 more reply

adamsmith1433y ago

Don't think you really know the field. On my team we almost exclusively use XGBoost or other boosted tree methods because it is typically the best model for tabular data. If we were working on CV or NLP that would be a different story and for that Neural Nets are by far the best models.

HelloNurse3y ago

This is a library for neural networks, and it should be compared to other neural networks solutions.

jstx13y ago

Everything you're listing works mostly on tabular data, not on text or images which is where we have the most impressive ML applications right now.

brrrrrm3y ago· 8 in thread

> It compiles a custom kernel for every operation, allowing extreme shape specialization.

This doesn't matter. Just look at the performance achieved by CuDNN kernels (which back PyTorch), they're dynamically shaped and hit near peak. For dense linear algebra at the size of modern neural networks, optimizing for the loop bound condition won't help much.

> All tensors are lazy, so it can aggressively fuse operations.

This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).

> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.

The first part of that sentence matters, the second part doesn't. Kernels are already fast and their reuse outside of being fused into each other (which you need a full linear algebra compiler to do) isn't very high. If you make sum fast, you have not made matrix multiplication fast even though MM has a sum in it. It just isn't that easy to compose operations and still hit 80+% of hardware efficiency.

But it is easier to iterate fast and build a seamless lazy compiler if your backend is simple. You can pattern match more easily and ensure you handle edge cases without insanely complicated things like alias analysis (which PyTorch has to do).

georgehotz3y ago

> they're dynamically shaped and hit near peak

While this is true for most common GEMM looking ops, if you tread off the beaten path things get slow (odd channel sizes, batch sizes, etc...). Right now in PyTorch, GroupNorm is 2x slower than BatchNorm. There's no fundamental reason, just that the kernels loop over axes in a less than ideal order. Dynamic recompilation allows you to change the loop order too, not just deal with boundary conditions.

brrrrrm3y ago

> tread off the beaten path things get slow

Yea, makes sense. I think there's something to be said for dynamic compilation solving this problem more elegantly than providing tons of hand-tuned kernels (PyTorch is 890MB lmao https://pypi.org/project/torch/#files), but I don't think it's a strict reason for a performance win.

> change the loop order too

Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.

1 more reply

twothreeone3y ago

> Right now in PyTorch, GroupNorm is 2x slower than BatchNorm

How did you benchmark this? I think there are like 3 or 4 different GN implementations in PyTorch..

1 more reply

markisus3y ago

What does it mean to "fuse operations"?

brrrrrm3y ago

avoiding writes to memory and reducing the number of loops (although not FLOPs)

    for j in range(10):
      c[j] = a[j] + b[j]
    for j in range(10):
      d[j] = c[j] * 2

becomes

    for j in range(10):
      d[j] = (a[j] + b[j]) * 2

1 more reply

FL33TW00D3y ago

Any more writing on laziness in frameworks? I'm trying to implement it myself.

brrrrrm3y ago

The only thing I'd recommend is exposing "eval()" or something to let users tell you when they want you to evaluate things. It'll save a ton of time when it comes to hot-fixing performance and memory use issues. It's really hard to determine when to evaluate, and although it's a fun problem to figure out, it's nice to have an escape hatch for users to just tell you. (Flashlight has explored this and written about it here: https://fl.readthedocs.io/en/latest/debugging.html?highlight...)

If you're interested, I've looked into symbolic laziness, which allows you to infer correct input sizes even when the constraints happen later. Can be useful for errors. https://dev-discuss.pytorch.org/t/loop-tools-lazy-frontend-e...

bmc75053y ago

https://arxiv.org/abs/2203.08069

orlp3y ago· 8 in thread

    > It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
    >
    > - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
    > - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
    > - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
    > - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
    >
    > But how...where are your CONVs and MATMULs? Read the code to solve this mystery.

Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.

georgehotz3y ago

That CONV is only used on the older backends. The GPU and LLVM backend rewrite CONV as MUL+SUM, to be fused later, and thus only use the 4 OpTypes.

https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...

WithinReason3y ago

That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?

1 more reply

brrrrrm3y ago

I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:

https://github.com/facebookresearch/loop_tool/blob/main/pyth...

The idea is basically this: https://news.ycombinator.com/item?id=28883086

WithinReason3y ago

To directly quote the source:

    # these are the llops your accelerator must implement, along with toCpu
    UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
    BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
    ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
    MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
    ProcessingOps = Enum("ProcessingOps", ["CONV"])

https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?

georgehotz3y ago

Min is an HLOP.

From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))

All folded together, no slower than MAX.

1 more reply

liuliu3y ago

Very similar idea as Jittor, convolution definitely can be break down: https://github.com/Jittor/jittor/blob/master/python/jittor/n...

jerpint3y ago

Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?

PartiallyTyped3y ago

Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.

For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .

I can easily recommend the einops package and using einsum, simplifies things significantly.

RektBoy3y ago· 4 in thread

No Bible quotes? I'm disappointed...

passion__desire3y ago

I can't believe how can someone so accomplished believe in God.

li4ick3y ago

Like Knuth? He even has a book about it: https://www.goodreads.com/book/show/484459.Things_a_Computer...

RektBoy3y ago

I come from probably the most atheistic country in the world (CZ)

Yet, I had made this bashing comment about bible. IMHO anyone can believe in whatever they want. Christ., Islam, anything. I (and I would say every friend of mine) don't care about what do you believe in, but if you publicly preach some religion, prepare to be made fun of, or take a stand and try defend it your religion with arguments. But no blind faith here.

(Personally, if I like some religion it's Shinto.)

thrtythreeforty3y ago

I think it's orthogonal. There are tons of smart people who believe in God. (Knuth has already been mentioned.)

If God wanted, He could make himself apparent to everyone. Clearly that isn't the case; there is room to doubt or to believe no matter how smart or accomplished you are.

1 more reply

eterevsky3y ago· 3 in thread

How is it compared to JAX? After TensorFlow and PyTorch, JAX seems very simple, basically an accelerated numpy with just a few additional useful features like automatic differentiation, vectorization and jit-compilation. In terms of API I don't see how you can go any simpler.

learndeeply3y ago

JAX is a DSL on top of XLA, instead of writing Python. Example: a JAX for loop looks like this:

   def summ(i, v): return i + v
   x = jax.lax.fori_loop(0, 100, summ, 5)

A for loop in TinyGrad or PyTorch looks like regular Python:

   x = 5
   for i in range(0, 100):
      x += 1

By the way, PyTorch also has JIT.

eterevsky3y ago

I've just tried making a loop in a jit-compiled function and it just worked:

    >>> import jax
    >>> def a(y):
    ...   x = 0
    ...   for i in range(5):
    ...     x += y
    ...   return x
    ...
    >>> a(5)
    25
    >>> a_jit = jax.jit(a)
    >>> a_jit(5)
    DeviceArray(25, dtype=int32, weak_type=True)

1 more reply

cl3misch3y ago

He mentioned in a recent stream that he dislikes the complexity of the XLA instruction set used by JAX. So it's less the user-facing API, and more the inner workings of the library.

ivalm3y ago· 3 in thread

“Almost 9k stars” is actually 7.3k stars…

But otherwise very cool project :)

queuebert3y ago

Maybe he hired a bunch of bot accounts to star it; then when the accounts were banned the stars were removed? ^_^

dqft3y ago

because milestones for children of the 90s

ivalm3y ago

I am dumb but now it makes sense. His power is indeed over 9k.

stephc_int133y ago· 3 in thread

I understand that the Python code is mostly driving faster low-level code, but I wonder how much time is effectively wasted by not using a lower-level language.

From my experience with game engines, it often turns out to be a bad idea (for performance and maintainability) to mix C/C++ and Lua or C#.

terafo3y ago

I would argue that there are performance *benefits* for a developer in running python code, due to how programs are run in python(Jupyter notebooks) you basically can change program on the fly, and not recompile and restart it, as you would do with compiled languages. And yeah, CPU does very very little in modern DL workloads and it is commonplace for CPU python code to be jitted and vectorized, so performance difference isn't as large as you would think.

lynndotpy3y ago

This is very true!

Another benefit to interactivity is when exploring/using bad code. In academia, you'll often be importing the worst and least-well-documented code you've ever seen.

Being able to interactively experiment with someones 500-line 0-documentation function is often a better path to understanding than directly reading the code.

brrrrrm3y ago

Doesn’t really matter for large batch/large model training on GPUs that don’t need much coordination.

But Python speed is one of the main motivations for a JS/TS based ML lib I’m working on: https://github.com/facebookresearch/shumai

dedoussis3y ago· 2 in thread

It's funny that geohot/tinygrad chooses to not meet the PEP8 standards [0] just to stay on brand (<1000 lines). Black [1] or any other python autoformatter would probably 2x the lines of code.

[0] https://peps.python.org/pep-0008/

[1] https://github.com/psf/black

kurisufag3y ago

to anybody experienced in writing functional-esque oneliners, PEP8 is an appalling waste of space

koningrobot3y ago

More like 10x. Black is truly a terrible thing.

jamesrom3y ago· 2 in thread

tinygrad core is over 1000 loc now[1]. If anyone was looking for a fun weekend project :)

https://github.com/geohot/tinygrad/blob/master/.github/workf...

KptMarchewa3y ago

It does achieve that by being most horizontally dense Python code I've ever seen.

orf3y ago

Wow https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

alexmolas3y ago· 2 in thread

> almost 9000 GitHub stars

I wouldn't say that 7500 stars is almost 9000 stars ;)

ordu3y ago

It is probably based on a meme https://en.wikipedia.org/wiki/It%27s_Over_9000!

They are not over 9k yet but closing.

alexmolas3y ago

ah, I didn't get the reference haha

anyway, I just gave them my star ;)

lostmsu3y ago· 2 in thread

It was ok as an educational tool, but now they don't count GPU implementation in 1000 lines, so it is not small. Considering the code style it is closer to 20k+ lines when formatted and GPU code included.

It also doesn't support bfloat16 so is doomed to be 2x slower.

terafo3y ago

Actual code of tinygrad is less than 5k lines. There is also 1600 lines of tests and around 2k lines of example models. And I didn't count unfinished support for geohot's own unfinished neural network accelerator(verilog for that accelerator sits in repo too), which is abandoned.

lostmsu3y ago

> Actual code of tinygrad is less than 5k lines

Yeah, not

> Considering the code style

I mean it is possible to read it, but I would not say it is optimized for it. Which I suppose betrays the goal.

1 more reply

kwant_kiddo3y ago· 2 in thread

I think posts like this are only getting upvotes because George Hotz owns the project. I do see value in simple code, but the constraint of 1000 LOC makes little sense to me, especially when the code is formatted poorly.

This will get downvoted, but reading the comments here I dont understand the (cult/respect) for him. Siding with the most successful CTF-team ever (PPP) he won defcon two times. He made a startup with funding that makes a cool 'niche' product.

I just think a guy like Chris Lattner or Dave Cutler who made so much impact on real computing deserve so much more respect, but I guess that the norm here is to admire this guy.

gamegoblin3y ago

Your list lacks the reason for his initial fame: iPhone and PS3 jailbreaking.

And I think you're downplaying the achievements of Comma AI -- it may still be somewhat niche, but its product is better than Tesla Autopilot for highway driving (they aren't there on city driving / FSD yet), all with an absolutely tiny team.

lostmsu3y ago

Re: Comma AI. This is what it tells me about my run-of-the-mill Toyota:

> openpilot upgrades your Toyota Highlander Hybrid with automated lane centering at all speeds, and adaptive cruise control that automatically resumes from a stop.

Both are annoying artificial limitations Toyota put presumably to avoid abuse by inattentive drivers.

I mean it can't change lanes. What does it do exactly?

1 more reply

sakras3y ago· 1 in thread

I must say they gained instant credibility with the minimalistic website given how fast it loaded.

Code looks simple and easy to follow, and I love how the comments are constantly mentioning hardware characteristics, making maxing the hardware the goal. It seems that it’s trying to achieve this by jitting optimal code for the operations at hand rather than hand-optimizing kernels, and betting that the small number of operations will make tuning the codegen tractable.

I haven’t kept up much with what’s happening in ML, but at least in the realm of columnar database engines, interpreting a series of hand-optimized kernels seems to be the dominant approach over compiling a vectorized query plan. Are compilers good enough at optimizing ML operations that specializing on input shape makes a difference over hand-tuned kernels?

kklisura3y ago

It's geohot. He comes with credibility. [1]

[1] https://en.wikipedia.org/wiki/George_Hotz

JacobiX3y ago· 1 in thread

I love those tiny DNN frameworks, some examples that I studied in the past (I still use PyTorch for work related projects) :

thinc.by the creators of spaCy https://github.com/explosion/thinc

nnabla by Sony https://github.com/sony/nnabla

LibNC by Fabrice Bellard https://bellard.org/libnc/

Dlib dnn http://dlib.net/ml.html#add_layer

37ef_ced33y ago

And https://NN-512.com

emaro3y ago· 1 in thread

I love this website. Their style tag literally is:

    <style>
      body {
        font-family:'Lucida Console', monospace
      }
    </style>

Also look like a very cool project.

arketyp3y ago

I like this too and I don't understand the downvotes. It says a lot about the philosophy of the project. Minimalist, bold, brutalist, no-frills first principles thinking. For better and worse.

tucosan3y ago· 1 in thread

Can someone from the ML crowd ELI5 to me what tinygrad does, how it plugs into an ML pipeline and what it's use cases are?

gamegoblin3y ago

There are libraries like tensorflow and PyTorch that allow the user to define their neural net in simple, readable Python code, and they internally "compile" and optimize your neural net to run on GPUs and such.

Tinygrad is like a very, very lean PyTorch with a different philosophy -- it intends to keep the codebase and API surface very very small and focus most of its energy on optimizing the way the output neural net runs on physical hardware.

The author, George Hotz, has observed in the last few years that neural net performance is hindered by lack of optimization here, particularly around memory accesses.

gregjw3y ago· 1 in thread

Geohot at it again, this guy nails everything.

mhh__3y ago

This doesn't really nail anything at the moment.

It used to nail simplicity but now its a mess IMO

therealchiggs3y ago

There's an interesting roadmap in the "cherry" folder of the git repo[0]. It begins by bringing up a design on FPGA and ends with selling the company for $1B+ by building accelerator cards to compete with NVIDIA:

  Cherry Three (5nm tapeout)
  =====
  * Support DMA over PCI-E 4.0. 32 GB/s
  * 16 cores
  * 8M elements in on board RAM of each core (288 MB SRAM on chip)
  * Shared ~16GB GDDR6 between cores. Something like 512 GB/s
  * 16x 32x32x32 matmul = 32768 mults
  * 1 PFLOP @ 1 ghz (finally, a petaflop chip)
  * Target 300W, power savings from process shrink
  * This card should be on par with a DGX A100 and sell for $2000

  * At this point, we have won.
  * The core Verilog is open source, all the ASIC speed tricks are not.
  * Cherry will dominate the market for years to come, and will be in every cloud.
  * Sell the company for $1B+ to anyone but NVIDIA

[0] https://github.com/geohot/tinygrad/blob/master/accel/cherry/...

lr19703y ago

As it was recently discussed at length here on HN [0] (401 comments), George Hotz (the lead of tinygrad) is taking time off his self-driving startup comma.ai [1]. Curious if this would help or hurt tinygrad progress.

[0] https://news.ycombinator.com/item?id=33406790

[1] https://comma.ai/

fragmede3y ago

Of course, the stable diffusion tie-in is not to be missed!

https://github.com/geohot/tinygrad/blob/master/examples/stab...

bfrankline3y ago

If you care exclusively about numerical stability and performance, why _this_ set of operators (e.g., there’re plenty of good reasons to include expm1 or log1p and certainly trigonometric functions)? It’d be an interesting research problem to measure and identify the minimal subset of operators (and I suspect it’d look differently than what you’d expect from an FPU).

If you care exclusively about minimalism, why not limit yourself to the Meijer-G function (or some other general-purpose alternative)?

learndeeply3y ago

The code is very easy to read. Doesn't seem like there's data/model parallelism support for training, which will be important for real-world use.

bArray3y ago

How does this compare on embedded systems for performance? For example PyTorch vs tinygrad, or Darknet vs tinygrad?

bullen3y ago

Does anyone know of a neural network that is written in C and GLSL and that runs on normal OpenGL?

neets3y ago

4 OpCodes, I think Geohot is taking a cue from his favorite intellectual's Curtis Yavin's Urbit project

j / k navigate · click thread line to collapse

142 comments

101 comments · 26 top-level

matesz3y ago· 18 in thread

If anybody is dealing with procrastination watch George Hotz live streaming 10h straight working on this library [1][2]. Does he take some supplements to do this? There is even 19.5h stream [3].

Actually I have local obs setup to record myself, just instead of streaming I do recordings for my own inspection. Important part is to do the inspection after. It works wonders.

[1] https://youtu.be/GXy5eVwnL_Q

[2] https://m.youtube.com/watch?v=Cb2KwcnDKrk

[3] no joke, 19.5h stream https://www.youtube.com/watch?v=xc0jGZYFQLQ

terafo3y ago

It isn't 19.5 hour stream, I went and randomly clicked on couple of timestamps and stumbled upon[1]. So it's two almost-10-hour-long streams put together because they are thematically similar.

[1] https://youtu.be/xc0jGZYFQLQ?t=34333

rl33y ago

>Does he take some supplements to do this? There is even 19.5h stream [3]

It's probably the fact he has an audience. I can't speak for him, but that'd sure as hell light a fire under my ass—or at least significantly reduce procrastination.

kramerger3y ago

Tried to watch some videos but the high resolution/tiny font made it hard to watch.

I used to watch scanlime do 8 hour sw/hw sessions, really hope she comes back soon.

I have watched Brandon Falk do 10-11 hours of rust programming, although he sometimes take a break to play games for 4-5 hours (while in stream)

ren_engineer3y ago

External motivation of having an audience would also help

blt3y ago

Factorio?

pyinstallwoes3y ago

If it's not Adderall I don't know. But, if I've ever focused for that long it's been because of Ritalin or Adderall.

kklisura3y ago

1 more reply

nextlevelwizard3y ago

Is 10 hours really _that_ strange? You are (hopefully) focusing 8 hours "straight" during work _every day_.

If you watch Hotz's streams he takes small breaks to talk with chat and to meme around (just like everyone else during their work days) and he eats lunch and whatever (again just like everyone else).

What I'm trying to say is that Hotz's isn't a superman on Adderall he is just working on stuff he is excited about.

6 more replies

_0nel3y ago

yeah man he does, but he is crazy genius like Nikola Tesla or something and I’m not

langsoul-com3y ago

What's the file size for your recordings? 5 hours of 720p would be huge.

albert_e3y ago

YouTube lets you livestream from OBS but mark the stream as private.

You get unlimited free storage of your streams for your personal use that way without the need for any local storage at all.

I haven't come across any limits or downsides to this yet but happy to be corrected.

matesz3y ago

My plan is to make this obs-ndi plugin work on ubuntu, so I will be able to record on ubuntu to take the load off of mac which is my primary laptop.

PS. I forgot to read obs-ndi instructions properly, it works ok so now I can delegate regording to second laptop

terafo3y ago

It wouldn't be huge, I once recorded a week of me using my pc(so around 80 hours in total), and it was sub-100 gigs. It was 1080p with decent quality, don't remember FPS though.

jjallen3y ago

What do you do with your own recordings afterwards? How do they help you?

matesz3y ago

Here is screenshot of obs recording with sneak peak of my room https://imgur.com/a/m92R7Bx

RobertDeNiro3y ago

> Does he take some supplements to do this?

I think its just hyperfocus.

taneq3y ago

Yeah but what’s he supposed to be doing during that time? :P

locuscoeruleus3y ago

What do you inspect on your recordings?

DeathArrow3y ago· 13 in thread

I believe neural networks are over hyped sometimes.

If something lacks some keywords like neural network, deep learning, reinforced learning, than it is deemed not cool.

learndeeply3y ago

I can't think of anything that neural nets can't beat, except small tabular data with boosted decision trees. Can you give some examples?

zelphirkalt3y ago

Explicability is a big part of it It is often worth being a percent less accurat but having an explainable result.

1 more reply

jstx13y ago

(I don't really agree with GP's point but for the sake of answering your question)

1. Collaborative filtering based on a sparse dataset of implicit interactions.

2. Many time series applications.

1 more reply

niemandhier3y ago

Small data problems, where’re never the less have a really good idea of how things are causally related.

insane_dreamer3y ago

> we often use ML over DL in scientific analysis because we need models that can be inspected/explained not just results

> also, DL generally requires more data whereas you can get by with ML on less data if you have domain knowledge

patrick4513y ago

jack_pp3y ago

I'm no expert but can you show how those techniques can be used to solve the same problems NNs can? Like SOTA image recognition, chess / go, STT, TTS etc?

DeathArrow3y ago

>I'm no expert but can you show how those techniques can be used to solve the same problems NNs can?

Sentiment analysis, classification.

1 more reply

minimaxir3y ago

The problems where traditional ML works best and the problems where Transformers or ConvNets work best are usually two different domains.

AI is not a buzzword.

DeathArrow3y ago

>The problems where traditional ML works best and the problems where Transformers or ConvNets work best are usually two different domains.

Yes and we are using NN for everything.

1 more reply

adamsmith1433y ago

HelloNurse3y ago

This is a library for neural networks, and it should be compared to other neural networks solutions.

jstx13y ago

Everything you're listing works mostly on tabular data, not on text or images which is where we have the most impressive ML applications right now.

brrrrrm3y ago· 8 in thread

> It compiles a custom kernel for every operation, allowing extreme shape specialization.

> All tensors are lazy, so it can aggressively fuse operations.

This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).

> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.

georgehotz3y ago

> they're dynamically shaped and hit near peak

brrrrrm3y ago

> tread off the beaten path things get slow

> change the loop order too

Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.

1 more reply

twothreeone3y ago

> Right now in PyTorch, GroupNorm is 2x slower than BatchNorm

How did you benchmark this? I think there are like 3 or 4 different GN implementations in PyTorch..

1 more reply

markisus3y ago

What does it mean to "fuse operations"?

brrrrrm3y ago

avoiding writes to memory and reducing the number of loops (although not FLOPs)

    for j in range(10):
      c[j] = a[j] + b[j]
    for j in range(10):
      d[j] = c[j] * 2

becomes

    for j in range(10):
      d[j] = (a[j] + b[j]) * 2

1 more reply

FL33TW00D3y ago

Any more writing on laziness in frameworks? I'm trying to implement it myself.

brrrrrm3y ago

bmc75053y ago

https://arxiv.org/abs/2203.08069

orlp3y ago· 8 in thread

    > It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
    >
    > - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
    > - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
    > - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
    > - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
    >
    > But how...where are your CONVs and MATMULs? Read the code to solve this mystery.

georgehotz3y ago

That CONV is only used on the older backends. The GPU and LLVM backend rewrite CONV as MUL+SUM, to be fused later, and thus only use the 4 OpTypes.

https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...

WithinReason3y ago

1 more reply

brrrrrm3y ago

I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:

https://github.com/facebookresearch/loop_tool/blob/main/pyth...

The idea is basically this: https://news.ycombinator.com/item?id=28883086

WithinReason3y ago

To directly quote the source:

    # these are the llops your accelerator must implement, along with toCpu
    UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
    BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
    ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
    MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
    ProcessingOps = Enum("ProcessingOps", ["CONV"])

https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?

georgehotz3y ago

Min is an HLOP.

From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))

All folded together, no slower than MAX.

1 more reply

liuliu3y ago

Very similar idea as Jittor, convolution definitely can be break down: https://github.com/Jittor/jittor/blob/master/python/jittor/n...

jerpint3y ago

PartiallyTyped3y ago

I can easily recommend the einops package and using einsum, simplifies things significantly.

RektBoy3y ago· 4 in thread

No Bible quotes? I'm disappointed...

passion__desire3y ago

I can't believe how can someone so accomplished believe in God.

li4ick3y ago

Like Knuth? He even has a book about it: https://www.goodreads.com/book/show/484459.Things_a_Computer...

RektBoy3y ago

I come from probably the most atheistic country in the world (CZ)

(Personally, if I like some religion it's Shinto.)

thrtythreeforty3y ago

I think it's orthogonal. There are tons of smart people who believe in God. (Knuth has already been mentioned.)

If God wanted, He could make himself apparent to everyone. Clearly that isn't the case; there is room to doubt or to believe no matter how smart or accomplished you are.

1 more reply

eterevsky3y ago· 3 in thread

learndeeply3y ago

JAX is a DSL on top of XLA, instead of writing Python. Example: a JAX for loop looks like this:

   def summ(i, v): return i + v
   x = jax.lax.fori_loop(0, 100, summ, 5)

A for loop in TinyGrad or PyTorch looks like regular Python:

   x = 5
   for i in range(0, 100):
      x += 1

By the way, PyTorch also has JIT.

eterevsky3y ago

I've just tried making a loop in a jit-compiled function and it just worked:

    >>> import jax
    >>> def a(y):
    ...   x = 0
    ...   for i in range(5):
    ...     x += y
    ...   return x
    ...
    >>> a(5)
    25
    >>> a_jit = jax.jit(a)
    >>> a_jit(5)
    DeviceArray(25, dtype=int32, weak_type=True)

1 more reply

cl3misch3y ago

He mentioned in a recent stream that he dislikes the complexity of the XLA instruction set used by JAX. So it's less the user-facing API, and more the inner workings of the library.

ivalm3y ago· 3 in thread

“Almost 9k stars” is actually 7.3k stars…

But otherwise very cool project :)

queuebert3y ago

Maybe he hired a bunch of bot accounts to star it; then when the accounts were banned the stars were removed? ^_^

dqft3y ago

because milestones for children of the 90s

ivalm3y ago

I am dumb but now it makes sense. His power is indeed over 9k.

stephc_int133y ago· 3 in thread

I understand that the Python code is mostly driving faster low-level code, but I wonder how much time is effectively wasted by not using a lower-level language.

From my experience with game engines, it often turns out to be a bad idea (for performance and maintainability) to mix C/C++ and Lua or C#.

terafo3y ago

lynndotpy3y ago

This is very true!

Another benefit to interactivity is when exploring/using bad code. In academia, you'll often be importing the worst and least-well-documented code you've ever seen.

Being able to interactively experiment with someones 500-line 0-documentation function is often a better path to understanding than directly reading the code.

brrrrrm3y ago

Doesn’t really matter for large batch/large model training on GPUs that don’t need much coordination.

But Python speed is one of the main motivations for a JS/TS based ML lib I’m working on: https://github.com/facebookresearch/shumai

dedoussis3y ago· 2 in thread

It's funny that geohot/tinygrad chooses to not meet the PEP8 standards [0] just to stay on brand (<1000 lines). Black [1] or any other python autoformatter would probably 2x the lines of code.

[0] https://peps.python.org/pep-0008/

[1] https://github.com/psf/black

kurisufag3y ago

to anybody experienced in writing functional-esque oneliners, PEP8 is an appalling waste of space

koningrobot3y ago

More like 10x. Black is truly a terrible thing.

jamesrom3y ago· 2 in thread

tinygrad core is over 1000 loc now[1]. If anyone was looking for a fun weekend project :)

https://github.com/geohot/tinygrad/blob/master/.github/workf...

KptMarchewa3y ago

It does achieve that by being most horizontally dense Python code I've ever seen.

orf3y ago

Wow https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

alexmolas3y ago· 2 in thread

> almost 9000 GitHub stars

I wouldn't say that 7500 stars is almost 9000 stars ;)

ordu3y ago

It is probably based on a meme https://en.wikipedia.org/wiki/It%27s_Over_9000!

They are not over 9k yet but closing.

alexmolas3y ago

ah, I didn't get the reference haha

anyway, I just gave them my star ;)

lostmsu3y ago· 2 in thread

It also doesn't support bfloat16 so is doomed to be 2x slower.

terafo3y ago

lostmsu3y ago

> Actual code of tinygrad is less than 5k lines

Yeah, not

> Considering the code style

I mean it is possible to read it, but I would not say it is optimized for it. Which I suppose betrays the goal.

1 more reply

kwant_kiddo3y ago· 2 in thread

I just think a guy like Chris Lattner or Dave Cutler who made so much impact on real computing deserve so much more respect, but I guess that the norm here is to admire this guy.

gamegoblin3y ago

Your list lacks the reason for his initial fame: iPhone and PS3 jailbreaking.

lostmsu3y ago

Re: Comma AI. This is what it tells me about my run-of-the-mill Toyota:

> openpilot upgrades your Toyota Highlander Hybrid with automated lane centering at all speeds, and adaptive cruise control that automatically resumes from a stop.

Both are annoying artificial limitations Toyota put presumably to avoid abuse by inattentive drivers.

I mean it can't change lanes. What does it do exactly?

1 more reply

sakras3y ago· 1 in thread

I must say they gained instant credibility with the minimalistic website given how fast it loaded.

kklisura3y ago

It's geohot. He comes with credibility. [1]

[1] https://en.wikipedia.org/wiki/George_Hotz

JacobiX3y ago· 1 in thread

I love those tiny DNN frameworks, some examples that I studied in the past (I still use PyTorch for work related projects) :

thinc.by the creators of spaCy https://github.com/explosion/thinc

nnabla by Sony https://github.com/sony/nnabla

LibNC by Fabrice Bellard https://bellard.org/libnc/

Dlib dnn http://dlib.net/ml.html#add_layer

37ef_ced33y ago

And https://NN-512.com

emaro3y ago· 1 in thread

I love this website. Their style tag literally is:

    <style>
      body {
        font-family:'Lucida Console', monospace
      }
    </style>

Also look like a very cool project.

arketyp3y ago

I like this too and I don't understand the downvotes. It says a lot about the philosophy of the project. Minimalist, bold, brutalist, no-frills first principles thinking. For better and worse.

tucosan3y ago· 1 in thread

Can someone from the ML crowd ELI5 to me what tinygrad does, how it plugs into an ML pipeline and what it's use cases are?

gamegoblin3y ago

The author, George Hotz, has observed in the last few years that neural net performance is hindered by lack of optimization here, particularly around memory accesses.

gregjw3y ago· 1 in thread

Geohot at it again, this guy nails everything.

mhh__3y ago

This doesn't really nail anything at the moment.

It used to nail simplicity but now its a mess IMO

therealchiggs3y ago

  Cherry Three (5nm tapeout)
  =====
  * Support DMA over PCI-E 4.0. 32 GB/s
  * 16 cores
  * 8M elements in on board RAM of each core (288 MB SRAM on chip)
  * Shared ~16GB GDDR6 between cores. Something like 512 GB/s
  * 16x 32x32x32 matmul = 32768 mults
  * 1 PFLOP @ 1 ghz (finally, a petaflop chip)
  * Target 300W, power savings from process shrink
  * This card should be on par with a DGX A100 and sell for $2000

  * At this point, we have won.
  * The core Verilog is open source, all the ASIC speed tricks are not.
  * Cherry will dominate the market for years to come, and will be in every cloud.
  * Sell the company for $1B+ to anyone but NVIDIA

[0] https://github.com/geohot/tinygrad/blob/master/accel/cherry/...

lr19703y ago

[0] https://news.ycombinator.com/item?id=33406790

[1] https://comma.ai/

fragmede3y ago

Of course, the stable diffusion tie-in is not to be missed!

https://github.com/geohot/tinygrad/blob/master/examples/stab...

bfrankline3y ago

If you care exclusively about minimalism, why not limit yourself to the Meijer-G function (or some other general-purpose alternative)?

learndeeply3y ago

The code is very easy to read. Doesn't seem like there's data/model parallelism support for training, which will be important for real-world use.

bArray3y ago

How does this compare on embedded systems for performance? For example PyTorch vs tinygrad, or Darknet vs tinygrad?

bullen3y ago

Does anyone know of a neural network that is written in C and GLSL and that runs on normal OpenGL?

neets3y ago

4 OpCodes, I think Geohot is taking a cue from his favorite intellectual's Curtis Yavin's Urbit project

j / k navigate · click thread line to collapse