What I wish someone had told me about tensor computation libraries (opens in new tab)

(eigenfoo.xyz)

302 points_eigenfoo5y ago85 comments

85 comments

Let me chip in with some self-promotion.

This book explains and executes every single line of code interactively, from low level operations to high-level networks that do everything automatically. The code is built on the state of the art performance operations of oneDNN (Intel, CPU) and cuDNN (CUDA, GPU). Very concise readable and understandable by humans.

https://aiprobook.com/deep-learning-for-programmers/

Here's the open source library built throughout the book:

https://github.com/uncomplicate/deep-diamond

Some chapters from the beginning of the book are available on my blog, as a tutorial series:

https://dragan.rocks

drzoltar5y ago

Machine Learning in Clojure reminds me of Yann LeCun’s ML course from 2010, where we used an adorable language called Lush:

http://lush.sourceforge.net/

which I suppose can best be described as Lisp and Python having a baby. It was immense fun to code neural networks from scratch in it. I hope Clojure can find a bigger place in the world of ML.

lr19705y ago

In those days there was a lovely LuaJIT based tensor manipulation language torch7 [1,2] developed by Leon Bottou. It later became basis for PyTorch. I still believe that Lua in general and LuaJIT in particular are much superior to Python for Deep Learning.

[1] http://torch.ch/

[2] https://github.com/torch/torch7

YuriNiyazov5y ago

Another student of LeCun from NYU here. Can attest that lush is adorable. For example:

For high performing parts of your code, a subset of lush would generate C code and compile them. I imagined that this is what it was like to write the first version of C++, the one that generated C code.

godelski5y ago

I'd actually love that material in C++/CUDA.

dragandj5y ago

If only C++ supported interactive REPL and the rest of Clojure/Lisp goodies, that might be possible. However, the code is CLOSELY related to the actual CUDA/C++ api. It's a lot simpler, concise, and everything, but I explain everything so that you can use the relevant parts with cuDNN and DNNL APIs in any language that you're most proficient in.

2 more replies

kyllo5y ago

"Deep Learning in Clojure with Fewer Parentheses than Keras and Python"

Love it! :D What better way to define a neural network in code than an S-expression?

piokoch5y ago

I am not sure if I am that enthusiastic. The problem with Lisp is not a number of parenthesis but where they are and what is their role. In c-like languages parenthesis help parser compiler but they also help humans to read the code. In case of Lisp they are just for the sake of the parser. Let's look on the code:

Python:

  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=(28, 28, 1)))

Clojure:

  (defonce net-bp
    (network (desc [128 1 28 28] :float :nchw)

Which one is more readable? looking on the Clojure code I see 128 1 28 28 thrown on me, without digging in the documentation I have no idea what's happening.

2 more replies

runetech5y ago

I really enjoy your work. I bought your book recently and appreciate your approach of building an understanding of the library based on "first principles". Really appreciate this performant and elegant option for working with deep-learning in Clojure - Thank you!

dragandj5y ago

Thanks!

wokwokwok5y ago

Concise isn’t always better.

You’re throwing alway all the names of the arguments and using arbitrary words like “conv” to represent operations.

This is typical bad clojure in my experience; write once, forget wtf the magic was, throw away and rewrite it again later.

Clojure doesn’t have to be incomprehensible arcane magic that does everything in 10 lines.

The more complex the code, the more important it is that what you do is clear and clearly documented.

Don’t write a 1 line regex to solve a complicated problem; it’s the wrong tool for that job, no matter how smart your substring matches are.

You don’t win a prize for making unmaintainable code.

I similarly think the goal of being burning my concise in ML code is deeply misguided.

dragandj5y ago

How is "conv" arbitrary? There is a function object that represents a convolutional layer in the network. It is bound to two symbols (because why not). You can either use "convolution" if you prefer full names, or "conv" if you prefer shorter. It doesn't represent the operation, but the layer. There are functions (with longer names) representing the convolution operation, which follow cuDNN and DNNL naming schemes.

Regarding the magic, I believe you haven't read my writings related to this. Exactly the opposite - there is no magic other than usual Clojure-fu, which I explain in a layered way.

But it's difficult to exactly reply to your critique, because you haven't given any example of an approach that would be good Clojure. Ok, give me an example of how you would do it in a comprehensible way (if what I provide is incomprehensible). You don't have to actually implement it. Show a non-working alternative. How would it look like?

hansvm5y ago

Concision is a style choice to be used with care. Spending screen space on additional characters and descriptions detracts from the ability to fit more logic on the screen at once and grok the larger flow. Splashing symbolic alphabet soup into your IDE in the name of concision isn't usually a good idea, but naming something "conv" in the immediate local context of a convolutional layer doesn't seem so bad.

1 more reply

37ef_ced35y ago

NN-512 (https://NN-512.com)

Generate fully vectorized, stand-alone, human-readable C99 code for neural net inference, and understand exactly what's happening. For example, watch the code run with Linux's perf top and see the relative costs of each layer of the computation. Total transparency, no dependencies outside the C POSIX library

joshuamorton5y ago

In what sense is this "better"?

The generated code is like

    __m512i wfs16 = _mm512_castsi256_si512(_mm512_cvtps_ph(wf25, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC));
    fs16 = _mm512_inserti64x4(wfs16, _mm512_cvtps_ph(wf26, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC), 1);
    _mm512_mask_storeu_epi32(wfPtr1+230400+38400*i5+768*c2+128*k1+64*m2+16*f3, 3855, wfs16);
    _mm512_mask_storeu_epi32(wfPtr1+345584+38400*i5+768*c2+128*k1+64*m2+16*f3, 61680, wfs16);

(which is a set of 4 lines that appear in the middle of an ~800 line function).

That's not "human readable".

Sure you can use asan or gdb, but if gdb profiles slowly, what can you do? You're still at the mercy of the code generator to be able to optimize things.

37ef_ced35y ago

Google those _mm512_... intrinsics (they are part of GCC) to see what they mean. The code you pasted is converting single-precision floats to half-precision floats, and storing the half-precision floats to memory, 32 at a time. That's filter packing, which happens during initialization (and never during inference)

I agree, if you don't know anything about how convolution is implemented (filter packing, data packing, matrix multiplication, sum unpacking), you could be lost. But it's very shallow compared to a JIT or CUDA library scheme, and a knowledgeable ML performance engineer would have no difficulty

The inference function (at the end of the C file) is a series of blocks, each block corresponding to a convolution or other complex operation. It's straightforward to see which, by looking at where the weights come from (a field in a struct that has the same name as the layer in your graph)

If you use perf top (for example) you can see which convolution was most expensive, and why. Does the shape of the tensor produce many small partial blocks around the edge, so the packing is inefficient (a lot of tile overhang), for example? You can see that by glancing at the code and seeing that there are many optimized blocks around the edges. As a rule, if NN-512 generates small code for a tensor (few edge cases) you have chosen an efficient tensor shape, with respect to the tile

Or you might find that batch normalization is being done at inference time (as in DenseNet), instead of being integrated into the convolution weights (as in ResNet), because there's fanout from the source and a ReLU in between. You can see that easily in the generated code (the batch norm fmadd instructions will appear in the packing or unpacking code)

Is the matrix multiplication slow because there are too few channels per group (as in ResNeXt)? Easy to see in perf, make your groups bigger. Are you using an inefficient filter shape, so we have to fall back to a slower general purpose convolution? You can easily see whether Winograd or Fourier was used

And so on

1 more reply

gameswithgo5y ago

i can read it! but then i spent months fiddling with intel intrinsics as a hobby

yudlejoza5y ago

Great. Thanks!

1. Any particular reason you chose to avoid GPUs?

2. Did you benchmark your code's performance against GPU-centric codes (ideally for the same problem and problem-size)?

37ef_ced35y ago

The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud compute instances

For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)

In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)

As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation

1 more reply

ssivark5y ago

GPUs are typically useful for training (due to massive parallelism), but not for inference.

1 more reply

bravura5y ago

I just want to say that I'm very interested in this library and have commented on it before. I'd really like to see it reach feature parity with pytorch or theano and emit your C++ code on the backend.

For example, I am not aware that one can currently use your library to implement Wavenet, other audio generative models like Wavegrad, or transformers.

Keep up the good work.

DSingularity5y ago

Yummy. Thanks. Gonna bookmark that one.

yongjik5y ago

> with dynamically generated graphs, the computational graph is never actually defined anywhere: the computation is traced out on the fly and behind the scene. You can no longer do anything interesting with the computational graph: for example, if the computation is slow, you can’t reason about what parts of the graph are slow.

Hmm, my experience is the opposite. When I used Tensorflow, there was no way I could figure out why something is slow, or require huge memory. All I have is a gigantic black box.

Meanwhile, in PyTorch, all I have to do is run it with CUDA_LAUNCH_BLOCKING=1, and it will give me an accurate picture of exactly how much milliseconds each line is taking! (Just print the current time before/after the line.) With nvprof it will even tell you which CUDA kernels are executing.

* Disclaimer: Haven't dabbled in ML for ~a year, so my view might be outdated now.

whimsicalism5y ago

Eh. I love pytorch, but it can definitely be difficult to reason about at times. For instance, due to async dispatch on GPU, you could get assertion errors where a line fails, but the real error was actually several lines above.

That was difficult to reason about.

atorodius5y ago

Wouldnt this be fixed by CUDA_LAUNCH_BLOCKING=1? Or putting a bunch of torch.cuda.synchronizes in the suspected lines.

1 more reply

jstrong5y ago

I'm a theano diehard, and I'll never get over how google came along, introduced a shittier version of theano, garnered worldwide acclaim for it, and killed the better library in the process.

alevskaya5y ago

Having written and debugged both Theano and TF plenty in the past, I think this is a somewhat uncharitable take, esp. recalling the absolutely enormous Theano compile times. :) I think Theano was genius, but a system that relied on python-string-based C++ code-emitters was always going to have trouble with long-term sustainability.

bravura5y ago

I am one of the authors of the Theano work. I am happy to hear that the Theano project is now being maintained again.

I will agree with alevskaya that the compilation times were an issue in my particular research ten years ago. I was trying to build neural-networks for parsing that were created at run-time. Since each parse tree had a different computation graph, I was not able to use Theano since it required compiling every single type of parse tree computation graph it encountered during training.

[edit if you want more details: There is really interesting old-school work called "Recursive distributed representations" and later "Labelling recursive auto-associative memory" that used auto-encoders to consume a variable length sequence, e.g. text string, in a sequential fashion. My work with Yoshua Bengio---incomplete---was based upon the idea of doing unsupervised binary parsing of sentences using a hierarchical RAAM-style approach: At any given point in time, greedily find the two adjacent tokens that could be most easily compressed into one token with low reconstruction error. However, once you apply this recursively and end up with auto-encoding binary parse trees, you end up with a variety of different computation graphs, each of which required separate compilation.]

1 more reply

cmarschner5y ago

Tensorflow 1.0 has its roots in how Theano was built. Same thing, a statically built graph that is run through a compilation step, with a numpy-like API. So what makes Theano such an ingenious concept while TF is regarded as “programming through a keyhole”?

dr_zoidberg5y ago

Here's my take about TF (in general, not particularly 1.x or 2.x):

Like many things from Google, I always had the impression that the library, while better than alternatives at the time, is too tailored to Google use cases. And if you fall outside of them, bad luck.

Still, at work we find it easier to deploy and interoperate with other tools than Pytorch. Hell, we have a guy working in Pytorch who converts his work to ONNX so that we can then connect those to some tooling we already have from back when TF was our only backend.

Could there be a better way? Perhaps. But we have to ship models and TF "just* works" (with a big asterisk, yeah).

bravura5y ago

I recently used TF 1.0 (former Theano author, current PyTorch user) and found TF 1.0 to be hellaciously difficult to grok and seemed to include a lot of unnecessary abstractions.

There was existing TF 1.0 code I was trying to extract gradients through (nsynth-wavenet). I spent over 8 hours on it unsuccessfully; I asked for help from a friend at Google who worked on TF and he couldn't figure it out either. I emailed the original author of the code and he acknowledged that he didn't know how to do it either, and he had an old notebook he could dig up that kinda would work with a lot of fixes.

1 more reply

bravura5y ago

I will say that I am very excited by the tftorch.py effort from @sillysaurusx: https://twitter.com/theshawwn/status/1311925180126511104

The idea being that pytorch can just be a high-level API executing lower-level tensorflow under the hood.

prideout5y ago

Are these libraries ever useful in non-deep learning applications? It sounds like Theano is a bit more general purpose, but why would I ever need it outside of a deep learning context?

I wonder if it could be used for something crazy, e.g. setting up a graph that generates shadertoy-like images on the GPU.

6gvONxR4sf7o5y ago

They are. Lots of numerical code benefits from GPU and lots of numerical code benefits from derivatives. Simulations, solvers, numerical optimization, good old fashioned statistics.

physicsyogi5y ago

Libraries like this enable differentiable programming, which lets you backprop through more than just neural networks. For instance, people have built a differentiable raytracer and plugged a physics engine into reinforcement learning to accelerate training.

https://en.wikipedia.org/wiki/Differentiable_programming

timkpaine5y ago

Idk about using these libraries, but its almost impossible to find generic graph libraries that aren't designed around either ML or alternatively scheduling batches. One such example is my own, https://github.com/timkpaine/tributary

nerdponx5y ago

Interesting library & idea, almost like its own programming paradigm when you abstract away all the specificity for building software or running ETL jobs or whatever.

But this is a completely different kind of graph. The graphs being discussed here are differentiable DAGs of mathematical computations.

1 more reply

tcpekin5y ago

We use them for computational imaging reconstruction in electron microscopy.

Const-me5y ago

I wonder does any of them have proper Windows support, i.e. DirectCompute?

CUDA is NVidia only and vendor lock in is bad for end users. Both CUDA, OpenCL and VK require large runtimes which are not included in the OS, software vendors like me need to redistribute and support it, I tend to avoid deploying libraries when I can.

cygaril5y ago

Seems to have missed the existence of jax.jit, which basically constructs an XLA program (call it a graph if you like) from your Python function which can then be optimized.

JHonaker5y ago

In the section title, JAX:

> But JAX even lets you just-in-time compile your own Python functions into XLA-optimized kernels...

nestorD5y ago

The authors gives that quote (from the JAX documentation) but does not seem to interiorize it as his conclusion says:

> This is the niche that Theano (or rather, Theano-PyMC/Aesara) fills that other contemporary tensor computation libraries do not: the promise is that if you take the time to specify your computation up front and all at once, Theano can optimize the living daylight out of your computation - whether by graph manipulation, efficient compilation or something else entirely - and that this is something you would only need to do once.

It is exactly what JAX does. There is a computational graph in JAX (its encoded in XLA and specified with their numpy like syntax), it is build once, optimized and then runs on the GPU.

easde5y ago

TorchScript JIT (torch.jit.script) is similar for PyTorch.

komuher5y ago

Not even cloese, jax.jit allow you to compute almost anything using lax.for_loops, lax.cond and other lax and jax contsturts pytorch jit does not allow that its just extra optimization for static pytorch functions.

1 more reply

PoignardAzur5y ago

Can someone ELI5 what are the differences between the different libraries are? The article uses a lot of jargon, an something that frustrates me about getting into machine learning is that teaching material will either abstract away what the internals do or assume that you already know how the internals work.

Some specific questions:

> They provide ways of specifying and building computational graphs

Is the article talking about neural networks? As in, arrays of arrays of weights, where input values go through successive layers, and for each layer the same instruction is applied to some values with the respective weight?

Or is it talking about a graph as in, a functional graph, where manually written functions call other manually written functions? (hence why a later paragraph talks about if-else statements and for loops)

> Almost all tensor computation libraries support autodifferentiation in some capacity (either forward-mode, backward-mode, or both).

What are those?

From the wikipedia article, it sounds like autodifferentiation basically means running f(x+dx)-f(x), but if there are entire frameworks handling it, then there's probably something fancier going on.

> According to the JAX quickstart, JAX bills itself as “NumPy on the CPU, GPU, and TPU, with great automatic differentiation for high-performance machine learning research”. Hence, its focus is heavily on autodifferentiation.

The earlier description makes it sound like JAX does some cutting-edge compilation stuff to transform semi-arbitrary functions (with ifs and else and loops and stuff) into a function that returns it derivative.

So how can that stuff run on the GPU? It sounds like there would be a lot of branching code.

And how is that related to machine learning / neural networks?

dangirsh5y ago

Related: The Simple Essence of Automatic Differentiation - Conal Elliot

- https://www.youtube.com/watch?v=ne99laPUxN4

- https://arxiv.org/abs/1804.00746

galaxyLogic5y ago

Why are they called TENSOR computation libraries?

albertzeyer5y ago

I was not aware that the PyMC developers have forked and continued Theano: https://github.com/pymc-devs/Theano-PyMC

It seems very active right now.

Here some further information: https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-i...

I haven't really found references to its new name "Aesara".

Apparently, the main new feature for Theano will be the JAX backend.

I wonder though, my experience when working with Theano, and also deep with the internals (trying to get further graph optimizations on theano.scan):

- Some parts of the code are not really clean.

- The code is extremely complex and hard to follow. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/...

- This also made it very complicated to perform optimizations on the graph. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/...

- In this specific case, it's also a problem of the API: theano.scan would return the whole sequence. But if you only need the last entry, i.e. y[-1], there is a very complicated optimization rule which checks for that. Basically many optimizations around theano.scan are very complicated because of that.

- Here is one attempt for some optimization on theano.scan: https://github.com/Theano/Theano/pull/3640

- The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. But if you have big graphs, even just building up the graph can take time, and the optimization passes will take much longer. This was one of the most annoying problems when working with Theano. The startup time to build the graph could easily take up some minutes. I also doubt that you can optimize this very much in pure Python -- I think you would need to reimplement that in C++ or so. When switching to TensorFlow, building the graph felt almost instant in comparison. I wonder if they have any plans on this in this fork.

- On the other side, the optimizations on the graph are quite nice. You don't really have to care too much when writing code like log(softmax(z)) -- it will optimize it also to be numerically stable.

- The optimizations also went so far to check if some op can work inplace on its input. Which made writing ops more complicated, because if you want to have nice performance, you would write two versions, one which works inplace on the tensor, and another one not. And then again 2 further versions if you want CUDA as well.

blt5y ago

Re. the last point, was trying to think of computations where 1) an efficient in-place version is possible, and 2) the most efficient out-of-place version is significantly faster than copying the input and executing the in-place version.

In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.

Any particular example that occurs to you?

albertzeyer5y ago

In Big-O notation, there will not be any difference, because copying the data will just be O(N), and whatever you do in the op will be at least O(N), so no change.

But in absolute terms, it could make a difference. Think of y = x + 1 vs y = x; y += 1. I would expect that the former is slightly faster. But actually I'm not really sure.

Actually, I implemented most of my native ops exactly in this way, i.e. I implemented the inplace version, and the non-inplace version would just additionally copy it and then call the inplace version.

btwillard5y ago

Hello, I'm the person spearheading this Theano fork! Your comments match my experience with the old Theano very well, so I have to respond.

> Apparently, the main new feature for Theano will be the JAX backend.

The JAX transpilation feature arose as a quick example of how flexible Theano can be, both in terms of its "hackability" and its simple yet effective foundation (i.e. "static" graphs). It's definitely not the main focus of the fork, but it is easily the newest feature that stands out at the user-level.

The points you raised about the old Theano are actually the main focus, and we've already made large internal changes that address a few of them directly. At the very least, nearly all of them are on the roadmap toward our new library named "Aesara".

The `Scan` `Op` and its optimizations are definitely going to change, and I have no intention of sacrificing improvements for backward compatibility, or anything else that would constrain the extent of improvements. I too have dealt with the difficulties involved in writing Scan optimizations (e.g. https://github.com/pymc-devs/symbolic-pymc/blob/master/symbo...) and am painfully aware of how unnecessary most of them are.

> - The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. ...

The most important graph optimization performance problems are not actually related to Python performance; they're demonstrably design and implementation induced. That is unless you're talking exclusively about graphs so large they reach the "natural" limits of Python performance by definition. Even then, a nearly one-to-one C translation isn't likely to solve those scaling problems.

For example, the graph optimization/rewriting framework would require entire graphs to be copied at multiple points in the process, and this was almost completely due to some design oddities. We've already made all of the large-scale changes needed in order to remedy this design constraint, so we're well on our way to fixing that. See https://github.com/pymc-devs/Theano-PyMC/pull/158

The rewriting process also doesn't track or use node information very well (or at all), so the whole optimization process itself can take an unnecessary number of passes through a graph. For instance, its "local" optimizations have a "tracking" option that specifies the `Op` types to which they apply; however, that feature isn't even used unless the local optimizations are applied by a `LocalOptGroup`. I've noticed at least a few instances in which these local optimizations are applied to inapplicable `Op`s on each visit to a node. Worse yet, within `LocalOptGroup` those local optimizations aren't applied directly to the relevant `Op`s, even though the requisite `Op` type-to-node information is readily available. In other words, optimizations could be directly applied to the relevant nodes in these cases and dramatically reduce the amount of blind graph traversals performed.

At best, a reimplementation in a language with a better compiler, like C, would largely amount to a questionable brute-force attempt at performance, and the ease of manipulating graphs and developing graph rewrites would suffer. With Aesara, we're going for the opposite. We want a smarter framework and _more_ focus on domain-specific optimizations (e.g. linear/tensor algebra, statistics, computer science) from the domain experts themselves, so code transparency and ease of development really matters. When we need raw performance in specific areas of the code, we'll pinpoint those areas and write C extensions, in standard Python fashion.

> ... When switching to TensorFlow, building the graph felt almost instant in comparison. ...

Last I checked, TensorFlow had almost no default graph optimizations, aside from some basic CSE and minor canonicalization and algebraic simplifications in the `grappler` module, so it absolutely should be instantaneous. More importantly, TensorFlow isn't designed for graph rewriting, and definitely not at the Python level where rapid prototyping and testing is possible outside of Google.

Otherwise, if you're talking about initially _building_ a graph and not calling `theano.function`, there are no optimizations involved. Latency in that case would be something entirely different and well worth reproducing for an issue. If what you were observing was the effect of calling `theano.function`, the latency was most likely due to the C transpilation and subsequent compilation. That's a feature that necessarily takes time, but produces code that's often faster than TensorFlow even today.

In summary, the changes we're most focused on right now are for developers like yourself who have had to deal with the core of Theano, so, please, stop by the fork and help us make a better `Scan`!

albertzeyer5y ago

Hey! Thanks for the answer!

By graph building, I actually meant graph compilation. In TF the first `session.run`, or in Theano the `theano.function`.

I did not get too much into the internals of the graph compilation + optimization (despite writing a couple of simple own optimization passes), so I don't really know whether sth is done really inefficient, but I can easily believe that. I agree, if sth is inefficient there, it should be rewritten in a more efficient way. But I also think that even if you have it as efficient as it can be, it still would be slow, compared to a C/C++/Rust implementation, easily by a factor of 100 or so. And even in C/C++ it can still be slow, when I consider how much time LLVM or GCC takes in their optimization passes.

Yes, TensorFlow does not have much optimization, although I think the idea was always to extend that. But then, as you say, this also is one of the reasons the graph compilation is so fast. But comparing the runtime performance of Theano vs TF, in most cases, TF was just as fast or faster (which is likely dependent on the specific model; but as far as I remember, that was the general observation by the community). So because of that, I was questioning whether all that heavy graph optimization is really worth it. Numerical stability is another topic, of course. But you can also have some simple logic for that, e.g. implement your own `safe_log`, which checks if the input is `softmax(x)`, and then directly returns `log_softmax(x)`. See e.g. here: https://github.com/rwth-i6/returnn/blob/6cd6b7b3b3d3beb33140...

Btw, graph rewriting in TF is certainly also possible, and not so complicated. But it's not really optimized for that. You cannot rewrite parts of the graph inplace. You would need to create a new copy. (Although, technically, I think it would not be too complicated to allow for more graph rewriting, also inplace. But it was/is just not a high priority.)

About `Scan`: I think the main problem is the API itself. I think it is easier if the underlying op would be `WhileLoop` or so, very similar to `tf.while_loop`. Then everything becomes very natural. However, then you would need some good way to accumulate your outputs, if you actually want to have the logic of `scan`. Sth like `ys = concat(ys, [y])` inside the loop. And then it probably is necessary to have specific optimizations on that to make that efficient. Or introduce sth like `TensorArray`. But in both cases, I think this is easier than working with `Scan` as the underlying op for loops.

Btw, in the blog post, it is written that TF is focusing on dynamic graphs now. While this indeed was an important focus when TF2 was introduced, I'm not sure whether they might take a step back again. Of course this is just speculation. But I think even internally, they are seeing the problems with dynamic graphs, and many groups still use the non-eager mode with static graphs and don't have any intention to switch away from that.

MaxBarraclough5y ago

As someone who knows nothing about this area:

> I get confused with tensor computation libraries (or computational graph libraries, or symbolic algebra libraries, or whatever they’re marketing themselves as these days).

Aren't tensors a sort of generalisation of matrices? How are they equivalent to graphs?

ogogmad5y ago

The word tensor in this context refers to a multidimensional array, not to a tensor in the mathematical sense. The computation graph is simply a representation of a sequence of arithmetic operations that you're performing on some data.

MaxBarraclough5y ago

I see, thanks.

nautilus125y ago

The last arguments about why you would want a static graph and even it's drawbacks and complaints sound basically similar to why you would want to do functional programming

sidhu1f5y ago

For the heavy lifting of the actual linear algebra computations, these tensor computation libraries typically use some variant of BLAS or eigen.

j / k navigate · click thread line to collapse

85 comments

dragandj5y ago

Let me chip in with some self-promotion.

https://aiprobook.com/deep-learning-for-programmers/

Here's the open source library built throughout the book:

https://github.com/uncomplicate/deep-diamond

Some chapters from the beginning of the book are available on my blog, as a tutorial series:

https://dragan.rocks

drzoltar5y ago

Machine Learning in Clojure reminds me of Yann LeCun’s ML course from 2010, where we used an adorable language called Lush:

http://lush.sourceforge.net/

which I suppose can best be described as Lisp and Python having a baby. It was immense fun to code neural networks from scratch in it. I hope Clojure can find a bigger place in the world of ML.

lr19705y ago

[1] http://torch.ch/

[2] https://github.com/torch/torch7

YuriNiyazov5y ago

Another student of LeCun from NYU here. Can attest that lush is adorable. For example:

godelski5y ago

I'd actually love that material in C++/CUDA.

dragandj5y ago

2 more replies

kyllo5y ago

"Deep Learning in Clojure with Fewer Parentheses than Keras and Python"

Love it! :D What better way to define a neural network in code than an S-expression?

piokoch5y ago

Python:

  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=(28, 28, 1)))

Clojure:

  (defonce net-bp
    (network (desc [128 1 28 28] :float :nchw)

Which one is more readable? looking on the Clojure code I see 128 1 28 28 thrown on me, without digging in the documentation I have no idea what's happening.

2 more replies

runetech5y ago

dragandj5y ago

Thanks!

wokwokwok5y ago

Concise isn’t always better.

You’re throwing alway all the names of the arguments and using arbitrary words like “conv” to represent operations.

This is typical bad clojure in my experience; write once, forget wtf the magic was, throw away and rewrite it again later.

Clojure doesn’t have to be incomprehensible arcane magic that does everything in 10 lines.

The more complex the code, the more important it is that what you do is clear and clearly documented.

Don’t write a 1 line regex to solve a complicated problem; it’s the wrong tool for that job, no matter how smart your substring matches are.

You don’t win a prize for making unmaintainable code.

I similarly think the goal of being burning my concise in ML code is deeply misguided.

dragandj5y ago

Regarding the magic, I believe you haven't read my writings related to this. Exactly the opposite - there is no magic other than usual Clojure-fu, which I explain in a layered way.

hansvm5y ago

1 more reply

37ef_ced35y ago

NN-512 (https://NN-512.com)

joshuamorton5y ago

In what sense is this "better"?

The generated code is like

    __m512i wfs16 = _mm512_castsi256_si512(_mm512_cvtps_ph(wf25, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC));
    fs16 = _mm512_inserti64x4(wfs16, _mm512_cvtps_ph(wf26, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC), 1);
    _mm512_mask_storeu_epi32(wfPtr1+230400+38400*i5+768*c2+128*k1+64*m2+16*f3, 3855, wfs16);
    _mm512_mask_storeu_epi32(wfPtr1+345584+38400*i5+768*c2+128*k1+64*m2+16*f3, 61680, wfs16);

(which is a set of 4 lines that appear in the middle of an ~800 line function).

That's not "human readable".

Sure you can use asan or gdb, but if gdb profiles slowly, what can you do? You're still at the mercy of the code generator to be able to optimize things.

37ef_ced35y ago

And so on

1 more reply

gameswithgo5y ago

i can read it! but then i spent months fiddling with intel intrinsics as a hobby

yudlejoza5y ago

Great. Thanks!

1. Any particular reason you chose to avoid GPUs?

2. Did you benchmark your code's performance against GPU-centric codes (ideally for the same problem and problem-size)?

37ef_ced35y ago

The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud compute instances

In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)

1 more reply

ssivark5y ago

GPUs are typically useful for training (due to massive parallelism), but not for inference.

1 more reply

bravura5y ago

For example, I am not aware that one can currently use your library to implement Wavenet, other audio generative models like Wavegrad, or transformers.

Keep up the good work.

DSingularity5y ago

Yummy. Thanks. Gonna bookmark that one.

yongjik5y ago

Hmm, my experience is the opposite. When I used Tensorflow, there was no way I could figure out why something is slow, or require huge memory. All I have is a gigantic black box.

* Disclaimer: Haven't dabbled in ML for ~a year, so my view might be outdated now.

whimsicalism5y ago

That was difficult to reason about.

atorodius5y ago

Wouldnt this be fixed by CUDA_LAUNCH_BLOCKING=1? Or putting a bunch of torch.cuda.synchronizes in the suspected lines.

1 more reply

jstrong5y ago

I'm a theano diehard, and I'll never get over how google came along, introduced a shittier version of theano, garnered worldwide acclaim for it, and killed the better library in the process.

alevskaya5y ago

bravura5y ago

I am one of the authors of the Theano work. I am happy to hear that the Theano project is now being maintained again.

1 more reply

cmarschner5y ago

dr_zoidberg5y ago

Here's my take about TF (in general, not particularly 1.x or 2.x):

Like many things from Google, I always had the impression that the library, while better than alternatives at the time, is too tailored to Google use cases. And if you fall outside of them, bad luck.

Could there be a better way? Perhaps. But we have to ship models and TF "just* works" (with a big asterisk, yeah).

bravura5y ago

I recently used TF 1.0 (former Theano author, current PyTorch user) and found TF 1.0 to be hellaciously difficult to grok and seemed to include a lot of unnecessary abstractions.

1 more reply

bravura5y ago

I will say that I am very excited by the tftorch.py effort from @sillysaurusx: https://twitter.com/theshawwn/status/1311925180126511104

The idea being that pytorch can just be a high-level API executing lower-level tensorflow under the hood.

prideout5y ago

Are these libraries ever useful in non-deep learning applications? It sounds like Theano is a bit more general purpose, but why would I ever need it outside of a deep learning context?

I wonder if it could be used for something crazy, e.g. setting up a graph that generates shadertoy-like images on the GPU.

6gvONxR4sf7o5y ago

They are. Lots of numerical code benefits from GPU and lots of numerical code benefits from derivatives. Simulations, solvers, numerical optimization, good old fashioned statistics.

physicsyogi5y ago

https://en.wikipedia.org/wiki/Differentiable_programming

timkpaine5y ago

nerdponx5y ago

Interesting library & idea, almost like its own programming paradigm when you abstract away all the specificity for building software or running ETL jobs or whatever.

But this is a completely different kind of graph. The graphs being discussed here are differentiable DAGs of mathematical computations.

1 more reply

tcpekin5y ago

We use them for computational imaging reconstruction in electron microscopy.

Const-me5y ago

I wonder does any of them have proper Windows support, i.e. DirectCompute?

cygaril5y ago

Seems to have missed the existence of jax.jit, which basically constructs an XLA program (call it a graph if you like) from your Python function which can then be optimized.

JHonaker5y ago

In the section title, JAX:

> But JAX even lets you just-in-time compile your own Python functions into XLA-optimized kernels...

nestorD5y ago

The authors gives that quote (from the JAX documentation) but does not seem to interiorize it as his conclusion says:

It is exactly what JAX does. There is a computational graph in JAX (its encoded in XLA and specified with their numpy like syntax), it is build once, optimized and then runs on the GPU.

easde5y ago

TorchScript JIT (torch.jit.script) is similar for PyTorch.

komuher5y ago

1 more reply

PoignardAzur5y ago

Some specific questions:

> They provide ways of specifying and building computational graphs

> Almost all tensor computation libraries support autodifferentiation in some capacity (either forward-mode, backward-mode, or both).

What are those?

From the wikipedia article, it sounds like autodifferentiation basically means running f(x+dx)-f(x), but if there are entire frameworks handling it, then there's probably something fancier going on.

So how can that stuff run on the GPU? It sounds like there would be a lot of branching code.

And how is that related to machine learning / neural networks?

dangirsh5y ago

Related: The Simple Essence of Automatic Differentiation - Conal Elliot

- https://www.youtube.com/watch?v=ne99laPUxN4

- https://arxiv.org/abs/1804.00746

galaxyLogic5y ago

Why are they called TENSOR computation libraries?

albertzeyer5y ago

I was not aware that the PyMC developers have forked and continued Theano: https://github.com/pymc-devs/Theano-PyMC

It seems very active right now.

Here some further information: https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-i...

I haven't really found references to its new name "Aesara".

Apparently, the main new feature for Theano will be the JAX backend.

I wonder though, my experience when working with Theano, and also deep with the internals (trying to get further graph optimizations on theano.scan):

- Some parts of the code are not really clean.

- The code is extremely complex and hard to follow. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/...

- This also made it very complicated to perform optimizations on the graph. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/...

- Here is one attempt for some optimization on theano.scan: https://github.com/Theano/Theano/pull/3640

- On the other side, the optimizations on the graph are quite nice. You don't really have to care too much when writing code like log(softmax(z)) -- it will optimize it also to be numerically stable.

blt5y ago

Any particular example that occurs to you?

albertzeyer5y ago

In Big-O notation, there will not be any difference, because copying the data will just be O(N), and whatever you do in the op will be at least O(N), so no change.

But in absolute terms, it could make a difference. Think of y = x + 1 vs y = x; y += 1. I would expect that the former is slightly faster. But actually I'm not really sure.

btwillard5y ago

Hello, I'm the person spearheading this Theano fork! Your comments match my experience with the old Theano very well, so I have to respond.

> Apparently, the main new feature for Theano will be the JAX backend.

> - The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. ...

> ... When switching to TensorFlow, building the graph felt almost instant in comparison. ...

In summary, the changes we're most focused on right now are for developers like yourself who have had to deal with the core of Theano, so, please, stop by the fork and help us make a better `Scan`!

albertzeyer5y ago

Hey! Thanks for the answer!

By graph building, I actually meant graph compilation. In TF the first `session.run`, or in Theano the `theano.function`.

MaxBarraclough5y ago

As someone who knows nothing about this area:

> I get confused with tensor computation libraries (or computational graph libraries, or symbolic algebra libraries, or whatever they’re marketing themselves as these days).

Aren't tensors a sort of generalisation of matrices? How are they equivalent to graphs?

ogogmad5y ago

MaxBarraclough5y ago

I see, thanks.

nautilus125y ago

The last arguments about why you would want a static graph and even it's drawbacks and complaints sound basically similar to why you would want to do functional programming

sidhu1f5y ago

For the heavy lifting of the actual linear algebra computations, these tensor computation libraries typically use some variant of BLAS or eigen.

j / k navigate · click thread line to collapse