This doesn't matter. Just look at the performance achieved by CuDNN kernels (which back PyTorch), they're dynamically shaped and hit near peak. For dense linear algebra at the size of modern neural networks, optimizing for the loop bound condition won't help much.
> All tensors are lazy, so it can aggressively fuse operations.
This matters. PyTorch teams are trying to implement that now (they have LazyTensor, AITemplate, TorchDynamo), but I'm not sure of the status (it's been tried repeatedly).
> The backend is 10x+ simpler, meaning optimizing one kernel makes everything fast.
The first part of that sentence matters, the second part doesn't. Kernels are already fast and their reuse outside of being fused into each other (which you need a full linear algebra compiler to do) isn't very high. If you make sum fast, you have not made matrix multiplication fast even though MM has a sum in it. It just isn't that easy to compose operations and still hit 80+% of hardware efficiency.
But it is easier to iterate fast and build a seamless lazy compiler if your backend is simple. You can pattern match more easily and ensure you handle edge cases without insanely complicated things like alias analysis (which PyTorch has to do).
While this is true for most common GEMM looking ops, if you tread off the beaten path things get slow (odd channel sizes, batch sizes, etc...). Right now in PyTorch, GroupNorm is 2x slower than BatchNorm. There's no fundamental reason, just that the kernels loop over axes in a less than ideal order. Dynamic recompilation allows you to change the loop order too, not just deal with boundary conditions.
Yea, makes sense. I think there's something to be said for dynamic compilation solving this problem more elegantly than providing tons of hand-tuned kernels (PyTorch is 890MB lmao https://pypi.org/project/torch/#files), but I don't think it's a strict reason for a performance win.
> change the loop order too
Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.
How did you benchmark this? I think there are like 3 or 4 different GN implementations in PyTorch..
for j in range(10):
c[j] = a[j] + b[j]
for j in range(10):
d[j] = c[j] * 2
becomes for j in range(10):
d[j] = (a[j] + b[j]) * 2If you're interested, I've looked into symbolic laziness, which allows you to infer correct input sizes even when the constraints happen later. Can be useful for errors. https://dev-discuss.pytorch.org/t/loop-tools-lazy-frontend-e...
> It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
>
> - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
> - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
> - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
> - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
>
> But how...where are your CONVs and MATMULs? Read the code to solve this mystery.
Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...
https://github.com/facebookresearch/loop_tool/blob/main/pyth...
The idea is basically this: https://news.ycombinator.com/item?id=28883086
# these are the llops your accelerator must implement, along with toCpu
UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
ProcessingOps = Enum("ProcessingOps", ["CONV"])
https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?
From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...
def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))
All folded together, no slower than MAX.
For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .
I can easily recommend the einops package and using einsum, simplifies things significantly.
Code looks simple and easy to follow, and I love how the comments are constantly mentioning hardware characteristics, making maxing the hardware the goal. It seems that it’s trying to achieve this by jitting optimal code for the operations at hand rather than hand-optimizing kernels, and betting that the small number of operations will make tuning the codegen tractable.
I haven’t kept up much with what’s happening in ML, but at least in the realm of columnar database engines, interpreting a series of hand-optimized kernels seems to be the dominant approach over compiling a vectorized query plan. Are compilers good enough at optimizing ML operations that specializing on input shape makes a difference over hand-tuned kernels?
thinc.by the creators of spaCy https://github.com/explosion/thinc
nnabla by Sony https://github.com/sony/nnabla
LibNC by Fabrice Bellard https://bellard.org/libnc/
Dlib dnn http://dlib.net/ml.html#add_layer
<style>
body {
font-family:'Lucida Console', monospace
}
</style>
Also look like a very cool project. Cherry Three (5nm tapeout)
=====
* Support DMA over PCI-E 4.0. 32 GB/s
* 16 cores
* 8M elements in on board RAM of each core (288 MB SRAM on chip)
* Shared ~16GB GDDR6 between cores. Something like 512 GB/s
* 16x 32x32x32 matmul = 32768 mults
* 1 PFLOP @ 1 ghz (finally, a petaflop chip)
* Target 300W, power savings from process shrink
* This card should be on par with a DGX A100 and sell for $2000
* At this point, we have won.
* The core Verilog is open source, all the ASIC speed tricks are not.
* Cherry will dominate the market for years to come, and will be in every cloud.
* Sell the company for $1B+ to anyone but NVIDIA
[0] https://github.com/geohot/tinygrad/blob/master/accel/cherry/...https://github.com/geohot/tinygrad/blob/master/examples/stab...
def summ(i, v): return i + v
x = jax.lax.fori_loop(0, 100, summ, 5)
A for loop in TinyGrad or PyTorch looks like regular Python: x = 5
for i in range(0, 100):
x += 1
By the way, PyTorch also has JIT. >>> import jax
>>> def a(y):
... x = 0
... for i in range(5):
... x += y
... return x
...
>>> a(5)
25
>>> a_jit = jax.jit(a)
>>> a_jit(5)
DeviceArray(25, dtype=int32, weak_type=True)Tinygrad is like a very, very lean PyTorch with a different philosophy -- it intends to keep the codebase and API surface very very small and focus most of its energy on optimizing the way the output neural net runs on physical hardware.
The author, George Hotz, has observed in the last few years that neural net performance is hindered by lack of optimization here, particularly around memory accesses.
But otherwise very cool project :)
https://github.com/geohot/tinygrad/blob/master/.github/workf...
From my experience with game engines, it often turns out to be a bad idea (for performance and maintainability) to mix C/C++ and Lua or C#.
Another benefit to interactivity is when exploring/using bad code. In academia, you'll often be importing the worst and least-well-documented code you've ever seen.
Being able to interactively experiment with someones 500-line 0-documentation function is often a better path to understanding than directly reading the code.
But Python speed is one of the main motivations for a JS/TS based ML lib I’m working on: https://github.com/facebookresearch/shumai
It used to nail simplicity but now its a mess IMO
I wouldn't say that 7500 stars is almost 9000 stars ;)
They are not over 9k yet but closing.
anyway, I just gave them my star ;)
If you care exclusively about minimalism, why not limit yourself to the Meijer-G function (or some other general-purpose alternative)?
It also doesn't support bfloat16 so is doomed to be 2x slower.
Yeah, not
> Considering the code style
I mean it is possible to read it, but I would not say it is optimized for it. Which I suppose betrays the goal.
They are not always the best tool for the job. There are lots of other ML techniques such as SVM, naive Bayes, k-nearest neighbor, decision tree, logistic regression, random forest etc. nobody is using because they lack the hype factor.
If something lacks some keywords like neural network, deep learning, reinforced learning, than it is deemed not cool.
1. Collaborative filtering based on a sparse dataset of implicit interactions.
2. Many time series applications.
> also, DL generally requires more data whereas you can get by with ML on less data if you have domain knowledge
Sentiment analysis, classification.
AI is not a buzzword.
Yes and we are using NN for everything.
Yet, I had made this bashing comment about bible. IMHO anyone can believe in whatever they want. Christ., Islam, anything. I (and I would say every friend of mine) don't care about what do you believe in, but if you publicly preach some religion, prepare to be made fun of, or take a stand and try defend it your religion with arguments. But no blind faith here.
(Personally, if I like some religion it's Shinto.)
If God wanted, He could make himself apparent to everyone. Clearly that isn't the case; there is room to doubt or to believe no matter how smart or accomplished you are.
Actually I have local obs setup to record myself, just instead of streaming I do recordings for my own inspection. Important part is to do the inspection after. It works wonders.
[1] https://youtu.be/GXy5eVwnL_Q
[2] https://m.youtube.com/watch?v=Cb2KwcnDKrk
[3] no joke, 19.5h stream https://www.youtube.com/watch?v=xc0jGZYFQLQ
It's probably the fact he has an audience. I can't speak for him, but that'd sure as hell light a fire under my ass—or at least significantly reduce procrastination.
Also, I find in periods where I've worked ~17 hours straight that the tiredness calms by brain to the point I'm normal and makes focus easy, albeit difficult in a different way due to fatigue. There's a weird drone zone there that's nice. Not something to make a habit of, though.
I used to watch scanlime do 8 hour sw/hw sessions, really hope she comes back soon.
I have watched Brandon Falk do 10-11 hours of rust programming, although he sometimes take a break to play games for 4-5 hours (while in stream)
External motivation of having an audience would also help
If you watch Hotz's streams he takes small breaks to talk with chat and to meme around (just like everyone else during their work days) and he eats lunch and whatever (again just like everyone else).
What I'm trying to say is that Hotz's isn't a superman on Adderall he is just working on stuff he is excited about.
You get unlimited free storage of your streams for your personal use that way without the need for any local storage at all.
I haven't come across any limits or downsides to this yet but happy to be corrected.
My plan is to make this obs-ndi plugin work on ubuntu, so I will be able to record on ubuntu to take the load off of mac which is my primary laptop.
PS. I forgot to read obs-ndi instructions properly, it works ok so now I can delegate regording to second laptop
There is a problem however - I work on mac m1 and obs recordings take full 4 out of 8 cores so actually I am recording only when I notice I am starting to procrastinate. All recordings I remove afterwords to save up space. Obs is turned on all the time though.
I wanted to use obs-ndi to combine output from another laptop, but I have some issues with it so I just record mac atm. I also have powerfull desktop on the side and can ssh between all those by name, but desktop is noisy so it's off most of the time. Also there is raspberry pi with simple script with which I can turn on desktop remotely via Wait On Lan udp packet, dns handled via https://www.noip.com with which my router has an integration, but I actually never used it. I've done this setup to justify purchasing this powerful desktop in the first place :) humble brag, I know.
Here is screenshot of obs recording with sneak peak of my room https://imgur.com/a/m92R7Bx
I think its just hyperfocus.
This will get downvoted, but reading the comments here I dont understand the (cult/respect) for him. Siding with the most successful CTF-team ever (PPP) he won defcon two times. He made a startup with funding that makes a cool 'niche' product.
I just think a guy like Chris Lattner or Dave Cutler who made so much impact on real computing deserve so much more respect, but I guess that the norm here is to admire this guy.
And I think you're downplaying the achievements of Comma AI -- it may still be somewhat niche, but its product is better than Tesla Autopilot for highway driving (they aren't there on city driving / FSD yet), all with an absolutely tiny team.
> openpilot upgrades your Toyota Highlander Hybrid with automated lane centering at all speeds, and adaptive cruise control that automatically resumes from a stop.
Both are annoying artificial limitations Toyota put presumably to avoid abuse by inattentive drivers.
I mean it can't change lanes. What does it do exactly?