This book explains and executes every single line of code interactively, from low level operations to high-level networks that do everything automatically. The code is built on the state of the art performance operations of oneDNN (Intel, CPU) and cuDNN (CUDA, GPU). Very concise readable and understandable by humans.
https://aiprobook.com/deep-learning-for-programmers/
Here's the open source library built throughout the book:
https://github.com/uncomplicate/deep-diamond
Some chapters from the beginning of the book are available on my blog, as a tutorial series:
which I suppose can best be described as Lisp and Python having a baby. It was immense fun to code neural networks from scratch in it. I hope Clojure can find a bigger place in the world of ML.
[1] http://torch.ch/
For high performing parts of your code, a subset of lush would generate C code and compile them. I imagined that this is what it was like to write the first version of C++, the one that generated C code.
Love it! :D What better way to define a neural network in code than an S-expression?
Python:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=(28, 28, 1)))
Clojure: (defonce net-bp
(network (desc [128 1 28 28] :float :nchw)
Which one is more readable? looking on the Clojure code I see 128 1 28 28 thrown on me, without digging in the documentation I have no idea what's happening.You’re throwing alway all the names of the arguments and using arbitrary words like “conv” to represent operations.
This is typical bad clojure in my experience; write once, forget wtf the magic was, throw away and rewrite it again later.
Clojure doesn’t have to be incomprehensible arcane magic that does everything in 10 lines.
The more complex the code, the more important it is that what you do is clear and clearly documented.
Don’t write a 1 line regex to solve a complicated problem; it’s the wrong tool for that job, no matter how smart your substring matches are.
You don’t win a prize for making unmaintainable code.
I similarly think the goal of being burning my concise in ML code is deeply misguided.
Regarding the magic, I believe you haven't read my writings related to this. Exactly the opposite - there is no magic other than usual Clojure-fu, which I explain in a layered way.
But it's difficult to exactly reply to your critique, because you haven't given any example of an approach that would be good Clojure. Ok, give me an example of how you would do it in a comprehensible way (if what I provide is incomprehensible). You don't have to actually implement it. Show a non-working alternative. How would it look like?
Generate fully vectorized, stand-alone, human-readable C99 code for neural net inference, and understand exactly what's happening. For example, watch the code run with Linux's perf top and see the relative costs of each layer of the computation. Total transparency, no dependencies outside the C POSIX library
The generated code is like
__m512i wfs16 = _mm512_castsi256_si512(_mm512_cvtps_ph(wf25, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC));
fs16 = _mm512_inserti64x4(wfs16, _mm512_cvtps_ph(wf26, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC), 1);
_mm512_mask_storeu_epi32(wfPtr1+230400+38400*i5+768*c2+128*k1+64*m2+16*f3, 3855, wfs16);
_mm512_mask_storeu_epi32(wfPtr1+345584+38400*i5+768*c2+128*k1+64*m2+16*f3, 61680, wfs16);
(which is a set of 4 lines that appear in the middle of an ~800 line function).That's not "human readable".
Sure you can use asan or gdb, but if gdb profiles slowly, what can you do? You're still at the mercy of the code generator to be able to optimize things.
I agree, if you don't know anything about how convolution is implemented (filter packing, data packing, matrix multiplication, sum unpacking), you could be lost. But it's very shallow compared to a JIT or CUDA library scheme, and a knowledgeable ML performance engineer would have no difficulty
The inference function (at the end of the C file) is a series of blocks, each block corresponding to a convolution or other complex operation. It's straightforward to see which, by looking at where the weights come from (a field in a struct that has the same name as the layer in your graph)
If you use perf top (for example) you can see which convolution was most expensive, and why. Does the shape of the tensor produce many small partial blocks around the edge, so the packing is inefficient (a lot of tile overhang), for example? You can see that by glancing at the code and seeing that there are many optimized blocks around the edges. As a rule, if NN-512 generates small code for a tensor (few edge cases) you have chosen an efficient tensor shape, with respect to the tile
Or you might find that batch normalization is being done at inference time (as in DenseNet), instead of being integrated into the convolution weights (as in ResNet), because there's fanout from the source and a ReLU in between. You can see that easily in the generated code (the batch norm fmadd instructions will appear in the packing or unpacking code)
Is the matrix multiplication slow because there are too few channels per group (as in ResNeXt)? Easy to see in perf, make your groups bigger. Are you using an inefficient filter shape, so we have to fall back to a slower general purpose convolution? You can easily see whether Winograd or Fourier was used
And so on
1. Any particular reason you chose to avoid GPUs?
2. Did you benchmark your code's performance against GPU-centric codes (ideally for the same problem and problem-size)?
For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)
In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)
As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation
For example, I am not aware that one can currently use your library to implement Wavenet, other audio generative models like Wavegrad, or transformers.
Keep up the good work.
Hmm, my experience is the opposite. When I used Tensorflow, there was no way I could figure out why something is slow, or require huge memory. All I have is a gigantic black box.
Meanwhile, in PyTorch, all I have to do is run it with CUDA_LAUNCH_BLOCKING=1, and it will give me an accurate picture of exactly how much milliseconds each line is taking! (Just print the current time before/after the line.) With nvprof it will even tell you which CUDA kernels are executing.
* Disclaimer: Haven't dabbled in ML for ~a year, so my view might be outdated now.
That was difficult to reason about.
I will agree with alevskaya that the compilation times were an issue in my particular research ten years ago. I was trying to build neural-networks for parsing that were created at run-time. Since each parse tree had a different computation graph, I was not able to use Theano since it required compiling every single type of parse tree computation graph it encountered during training.
[edit if you want more details: There is really interesting old-school work called "Recursive distributed representations" and later "Labelling recursive auto-associative memory" that used auto-encoders to consume a variable length sequence, e.g. text string, in a sequential fashion. My work with Yoshua Bengio---incomplete---was based upon the idea of doing unsupervised binary parsing of sentences using a hierarchical RAAM-style approach: At any given point in time, greedily find the two adjacent tokens that could be most easily compressed into one token with low reconstruction error. However, once you apply this recursively and end up with auto-encoding binary parse trees, you end up with a variety of different computation graphs, each of which required separate compilation.]
Like many things from Google, I always had the impression that the library, while better than alternatives at the time, is too tailored to Google use cases. And if you fall outside of them, bad luck.
Still, at work we find it easier to deploy and interoperate with other tools than Pytorch. Hell, we have a guy working in Pytorch who converts his work to ONNX so that we can then connect those to some tooling we already have from back when TF was our only backend.
Could there be a better way? Perhaps. But we have to ship models and TF "just* works" (with a big asterisk, yeah).
There was existing TF 1.0 code I was trying to extract gradients through (nsynth-wavenet). I spent over 8 hours on it unsuccessfully; I asked for help from a friend at Google who worked on TF and he couldn't figure it out either. I emailed the original author of the code and he acknowledged that he didn't know how to do it either, and he had an old notebook he could dig up that kinda would work with a lot of fixes.
The idea being that pytorch can just be a high-level API executing lower-level tensorflow under the hood.
I wonder if it could be used for something crazy, e.g. setting up a graph that generates shadertoy-like images on the GPU.
But this is a completely different kind of graph. The graphs being discussed here are differentiable DAGs of mathematical computations.
CUDA is NVidia only and vendor lock in is bad for end users. Both CUDA, OpenCL and VK require large runtimes which are not included in the OS, software vendors like me need to redistribute and support it, I tend to avoid deploying libraries when I can.
> But JAX even lets you just-in-time compile your own Python functions into XLA-optimized kernels...
> This is the niche that Theano (or rather, Theano-PyMC/Aesara) fills that other contemporary tensor computation libraries do not: the promise is that if you take the time to specify your computation up front and all at once, Theano can optimize the living daylight out of your computation - whether by graph manipulation, efficient compilation or something else entirely - and that this is something you would only need to do once.
It is exactly what JAX does. There is a computational graph in JAX (its encoded in XLA and specified with their numpy like syntax), it is build once, optimized and then runs on the GPU.
Some specific questions:
> They provide ways of specifying and building computational graphs
Is the article talking about neural networks? As in, arrays of arrays of weights, where input values go through successive layers, and for each layer the same instruction is applied to some values with the respective weight?
Or is it talking about a graph as in, a functional graph, where manually written functions call other manually written functions? (hence why a later paragraph talks about if-else statements and for loops)
> Almost all tensor computation libraries support autodifferentiation in some capacity (either forward-mode, backward-mode, or both).
What are those?
From the wikipedia article, it sounds like autodifferentiation basically means running f(x+dx)-f(x), but if there are entire frameworks handling it, then there's probably something fancier going on.
> According to the JAX quickstart, JAX bills itself as “NumPy on the CPU, GPU, and TPU, with great automatic differentiation for high-performance machine learning research”. Hence, its focus is heavily on autodifferentiation.
The earlier description makes it sound like JAX does some cutting-edge compilation stuff to transform semi-arbitrary functions (with ifs and else and loops and stuff) into a function that returns it derivative.
So how can that stuff run on the GPU? It sounds like there would be a lot of branching code.
And how is that related to machine learning / neural networks?
It seems very active right now.
Here some further information: https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-i...
I haven't really found references to its new name "Aesara".
Apparently, the main new feature for Theano will be the JAX backend.
I wonder though, my experience when working with Theano, and also deep with the internals (trying to get further graph optimizations on theano.scan):
- Some parts of the code are not really clean.
- The code is extremely complex and hard to follow. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/...
- This also made it very complicated to perform optimizations on the graph. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/...
- In this specific case, it's also a problem of the API: theano.scan would return the whole sequence. But if you only need the last entry, i.e. y[-1], there is a very complicated optimization rule which checks for that. Basically many optimizations around theano.scan are very complicated because of that.
- Here is one attempt for some optimization on theano.scan: https://github.com/Theano/Theano/pull/3640
- The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. But if you have big graphs, even just building up the graph can take time, and the optimization passes will take much longer. This was one of the most annoying problems when working with Theano. The startup time to build the graph could easily take up some minutes. I also doubt that you can optimize this very much in pure Python -- I think you would need to reimplement that in C++ or so. When switching to TensorFlow, building the graph felt almost instant in comparison. I wonder if they have any plans on this in this fork.
- On the other side, the optimizations on the graph are quite nice. You don't really have to care too much when writing code like log(softmax(z)) -- it will optimize it also to be numerically stable.
- The optimizations also went so far to check if some op can work inplace on its input. Which made writing ops more complicated, because if you want to have nice performance, you would write two versions, one which works inplace on the tensor, and another one not. And then again 2 further versions if you want CUDA as well.
In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.
Any particular example that occurs to you?
But in absolute terms, it could make a difference. Think of y = x + 1 vs y = x; y += 1. I would expect that the former is slightly faster. But actually I'm not really sure.
Actually, I implemented most of my native ops exactly in this way, i.e. I implemented the inplace version, and the non-inplace version would just additionally copy it and then call the inplace version.
> Apparently, the main new feature for Theano will be the JAX backend.
The JAX transpilation feature arose as a quick example of how flexible Theano can be, both in terms of its "hackability" and its simple yet effective foundation (i.e. "static" graphs). It's definitely not the main focus of the fork, but it is easily the newest feature that stands out at the user-level.
The points you raised about the old Theano are actually the main focus, and we've already made large internal changes that address a few of them directly. At the very least, nearly all of them are on the roadmap toward our new library named "Aesara".
The `Scan` `Op` and its optimizations are definitely going to change, and I have no intention of sacrificing improvements for backward compatibility, or anything else that would constrain the extent of improvements. I too have dealt with the difficulties involved in writing Scan optimizations (e.g. https://github.com/pymc-devs/symbolic-pymc/blob/master/symbo...) and am painfully aware of how unnecessary most of them are.
> - The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. ...
The most important graph optimization performance problems are not actually related to Python performance; they're demonstrably design and implementation induced. That is unless you're talking exclusively about graphs so large they reach the "natural" limits of Python performance by definition. Even then, a nearly one-to-one C translation isn't likely to solve those scaling problems.
For example, the graph optimization/rewriting framework would require entire graphs to be copied at multiple points in the process, and this was almost completely due to some design oddities. We've already made all of the large-scale changes needed in order to remedy this design constraint, so we're well on our way to fixing that. See https://github.com/pymc-devs/Theano-PyMC/pull/158
The rewriting process also doesn't track or use node information very well (or at all), so the whole optimization process itself can take an unnecessary number of passes through a graph. For instance, its "local" optimizations have a "tracking" option that specifies the `Op` types to which they apply; however, that feature isn't even used unless the local optimizations are applied by a `LocalOptGroup`. I've noticed at least a few instances in which these local optimizations are applied to inapplicable `Op`s on each visit to a node. Worse yet, within `LocalOptGroup` those local optimizations aren't applied directly to the relevant `Op`s, even though the requisite `Op` type-to-node information is readily available. In other words, optimizations could be directly applied to the relevant nodes in these cases and dramatically reduce the amount of blind graph traversals performed.
At best, a reimplementation in a language with a better compiler, like C, would largely amount to a questionable brute-force attempt at performance, and the ease of manipulating graphs and developing graph rewrites would suffer. With Aesara, we're going for the opposite. We want a smarter framework and _more_ focus on domain-specific optimizations (e.g. linear/tensor algebra, statistics, computer science) from the domain experts themselves, so code transparency and ease of development really matters. When we need raw performance in specific areas of the code, we'll pinpoint those areas and write C extensions, in standard Python fashion.
> ... When switching to TensorFlow, building the graph felt almost instant in comparison. ...
Last I checked, TensorFlow had almost no default graph optimizations, aside from some basic CSE and minor canonicalization and algebraic simplifications in the `grappler` module, so it absolutely should be instantaneous. More importantly, TensorFlow isn't designed for graph rewriting, and definitely not at the Python level where rapid prototyping and testing is possible outside of Google.
Otherwise, if you're talking about initially _building_ a graph and not calling `theano.function`, there are no optimizations involved. Latency in that case would be something entirely different and well worth reproducing for an issue. If what you were observing was the effect of calling `theano.function`, the latency was most likely due to the C transpilation and subsequent compilation. That's a feature that necessarily takes time, but produces code that's often faster than TensorFlow even today.
In summary, the changes we're most focused on right now are for developers like yourself who have had to deal with the core of Theano, so, please, stop by the fork and help us make a better `Scan`!
By graph building, I actually meant graph compilation. In TF the first `session.run`, or in Theano the `theano.function`.
I did not get too much into the internals of the graph compilation + optimization (despite writing a couple of simple own optimization passes), so I don't really know whether sth is done really inefficient, but I can easily believe that. I agree, if sth is inefficient there, it should be rewritten in a more efficient way. But I also think that even if you have it as efficient as it can be, it still would be slow, compared to a C/C++/Rust implementation, easily by a factor of 100 or so. And even in C/C++ it can still be slow, when I consider how much time LLVM or GCC takes in their optimization passes.
Yes, TensorFlow does not have much optimization, although I think the idea was always to extend that. But then, as you say, this also is one of the reasons the graph compilation is so fast. But comparing the runtime performance of Theano vs TF, in most cases, TF was just as fast or faster (which is likely dependent on the specific model; but as far as I remember, that was the general observation by the community). So because of that, I was questioning whether all that heavy graph optimization is really worth it. Numerical stability is another topic, of course. But you can also have some simple logic for that, e.g. implement your own `safe_log`, which checks if the input is `softmax(x)`, and then directly returns `log_softmax(x)`. See e.g. here: https://github.com/rwth-i6/returnn/blob/6cd6b7b3b3d3beb33140...
Btw, graph rewriting in TF is certainly also possible, and not so complicated. But it's not really optimized for that. You cannot rewrite parts of the graph inplace. You would need to create a new copy. (Although, technically, I think it would not be too complicated to allow for more graph rewriting, also inplace. But it was/is just not a high priority.)
About `Scan`: I think the main problem is the API itself. I think it is easier if the underlying op would be `WhileLoop` or so, very similar to `tf.while_loop`. Then everything becomes very natural. However, then you would need some good way to accumulate your outputs, if you actually want to have the logic of `scan`. Sth like `ys = concat(ys, [y])` inside the loop. And then it probably is necessary to have specific optimizations on that to make that efficient. Or introduce sth like `TensorArray`. But in both cases, I think this is easier than working with `Scan` as the underlying op for loops.
Btw, in the blog post, it is written that TF is focusing on dynamic graphs now. While this indeed was an important focus when TF2 was introduced, I'm not sure whether they might take a step back again. Of course this is just speculation. But I think even internally, they are seeing the problems with dynamic graphs, and many groups still use the non-eager mode with static graphs and don't have any intention to switch away from that.
> I get confused with tensor computation libraries (or computational graph libraries, or symbolic algebra libraries, or whatever they’re marketing themselves as these days).
Aren't tensors a sort of generalisation of matrices? How are they equivalent to graphs?