Tensor Comprehensions (opens in new tab)

(research.fb.com)

195 pointssmhx8y ago59 comments

59 comments

49 comments · 12 top-level

wrs8y ago· 12 in thread

>produce the high-performance codes that the machine learning community needs

Somewhat OT, but I've been wondering for a long time… Is the HPC community the only place the word "codes" is used like this? In usual CS parlance programming is done using a substance called "code" ("the high-performance code the community needs"), but in HPC literature the word "codes" is used, as if programming consisted of distinct objects. Does this arise from some divergent history (would I have called my LINPACK library punched card deck a "code"?) or what?

grandmczeb8y ago

Code is generally considered a mass noun among software engineers, but "codes" is pretty commonly used by academics, especially in other disciplines. In particular, physicists and mathematicians seem to use it pretty frequently, so that might explain why some in the HPC community use it as well.

I've also noticed that it seems more common among Europeans, but that might be just personal experience.

ngvrnd8y ago

Also as in "simulation codes". I associate it with physicists and Fortran programs, rightly or wrongly.

stochastic_monk8y ago

"Codes" tends to refer to error-correcting codes or encoding formats in my experience, while code refers to source or assembly code for machine execution.

One is a concept/idea/format while the other is a set of instructions.

r00fus8y ago

code:codes::math:maths?

1 more reply

fdej8y ago

Yes, this is just jargon specific to the HPC / numerical analysis field.

username2238y ago

And not without meaning. HPC people hold themselves to higher standards than the average Joes throwing stuff at the currently-fashionable "continuous integration server" until the lights turn green. "Codes" are often things that have been proven, written on a legal pad, typed into the machine, and finally validated on a set of test problems. Could you prove that your program has second-order numerical stability? https://en.wikipedia.org/wiki/Numerical_stability

1 more reply

HumanDrivenDev8y ago

Whenever I read or that, I assume it was a language issue. As a native English speaker "codes" sounds like a mistake.

evincarofautumn8y ago

It’s common in Indian English, as in “I have written a code to do the needful, but I have a doubt about it that perhaps you could answer.” Outside that dialect I’d call it an error.

bllguo8y ago

I've seen and heard 'codes' in scientific computing, but nowhere else. Seems to support your hypothesis

DoofusOfDeath8y ago

In computational acoustics, I've seen "codes" used more commonly, especially by those who started out in the punchcard days.

nicovasilache8y ago

punchcards .. days .. ouch that's painful :)

ribs8y ago

I started working with HPC users three years ago and yes, they're the only ones who used "codes" that way.

At least, the only ones in the last 30 years, I'm figuring - there is a reason we call it "code", and it's an old one!

phaedrus8y ago· 6 in thread

This web page is also the first I've heard of Halide and Polyhedral Compilation. This is exciting to me because I've been working on relational (database) data and logic comprehensions, and in a case of convergent evolution Halide looks a lot like my notation and Polyhedral Compilation looks much like diagrams I've been drawing on my whiteboard. Where can I learn more on this?

ozinenko8y ago

The paper has a couple of useful references.

Otherwise, we have a site with general information on polyhedral compilation http://polyhedral.info/ and Halide has its own site http://halide-lang.org/

richardlethin8y ago

The TC paper uses techniques that are in the R-Stream compiler. You can get more information about the R-Stream compiler at https://www.reservoir.com/research/publications/?cat=35 - the first paper there also validates the use of polyhedral compilation for deep learning last year. I particularly recommend the article about R-Stream in the Encyclopedia of Parallel Compilation, David Padua (ed). https://www.reservoir.com/publication/r-stream-compiler-2/

richardlethin8y ago

The R-Stream Encyclopedia article talks about how to raise to the polyhedral model from SSA; this is no problem especially for the super regular computations in Tensor computations used in deep learning. If you get in touch with me I can also get you a copy of the 2008 paper about R-Stream that described it: https://www.reservoir.com/publication/final-report-r-stream-...

By raising from C you don't need to use or learn a new language. Or one can generate C from a succinct notation or framework, and polyhedral optimize from there. That is what was in the R-Stream-TF paper.

aray8y ago

Is this paper published yet? The `Article` link doesn't go anywhere, and Google Scholar doesn't know where to find a copy.

richardlethin8y ago

Also, we're hiring: https://www.reservoir.com/company/careers/

nicovasilache8y ago

@phaedrus take a look at Weld from Stanford then, may also be useful to you. Edit: in addition to what @ozinenko pointed to of course..

falcor848y ago· 4 in thread

Could someone please explain how this compares to the TensorFlow approach? I can only assume that it's omitted from the article due to marketing reasons.

davesque8y ago

My understanding is that Tensor Comprehensions provides a way to automatically generate optimized CUDA code for algorithms written in a high-level language that more closely mirrors the notation used in mathematical formulas. So you could use it to automatically find more optimal low-level implementations for components used in libraries such as PyTorch and TensorFlow which usually call out to hand-written low-level implementations.

p1esk8y ago

How can it be both general, and fast? For example, CuDNN ops are fast because they are very specialized and highly tuned. Is convolution written with TC going to be as fast as CuDNN convolution?

Or, if TC's strength is in its generality, then what are the advantages over something like CuPy for Chainer?

Can someone give an example where TC shines?

2 more replies

smhxOP8y ago

Tensor Comprehensions are not a deep learning framework, but a way to write new optimized operators for deep learning. Hence, Tensor Comprehensions can be integrated with TensorFlow.

liuliu8y ago

There is XLA though: https://www.tensorflow.org/performance/xla/

But I guess unlike TC / Halide / TVM, that is less generic and much less customization.

tehsauce8y ago· 4 in thread

I'm a fan of evolutionary algorithms, but are they really effective enough here to be comparable to an engineer tuning code? They might be able to find a good configuration of a few canned options but real optimization often requires some creativity or at least an understanding of the hardware. Will certainly be interesting to see this in practice!

nicovasilache8y ago

Hello, evolutionary algorithms by themselves I am not sure and don't have enough experience with atm. In our particular context the key is the intermediate representation on which the evolution happens and the compiler behaviors it triggers. We are still far from competitive against highly tuned expert code in computation-bound regimes but we do see >50% shared memory BW peak usage in nvprof in multiple cases. This first iteration is aimed at addressing the immediate research productivity needs: no need to write hundreds of lines of framework integration and low-level kernels to get a reasonable CUDA performance. We think we are helping remove the first immediate bottleneck that are very frustrating in practice to ML researchers when their layer does not translate exactly in supported, optimized, vendor libraries.

ozinenko8y ago

The crucial part is the polyhedral optimizer which does indeed include several GPU-specific heuristics (multilevel parallelization, coalescing, etc) and specialization to tensor sizes. Evolutionary autotuner is used to tweak the parameters of the optimizer. As a result, TC can beat cublas and cudnn on certain networks; details in the report.

p1esk8y ago

What would be a relationship between TC and something like CuPy?

1 more reply

ychen3068y ago

Hand-tuning code doesn't scale to the kind of networks people write.

cgmg8y ago· 2 in thread

Related: http://tensor-compiler.org/codegen.html. This converts an expression in tensor index notation into executable code.

ozinenko8y ago

Sure, it is one of the works we cite. It seems to be mostly targeted at sparse computations and does not have GPU support.

Tensor Comprehensions does not try to manage memory and thus can be integrated into DL frameworks easily.

fredrikbk8y ago

That is correct (and we appreciate the citation). The tensor compiler (taco) has focused on compiling expressions that contain one or more sparse tensors so far and, even though it can generate code for dense expressions just fine, does not optimize these like TC, TCE, and XLA does. We are working on a scheduling language for it so that it will perform well across all types of formats and on GPUs.

pathsjs8y ago· 2 in thread

I could not find how to integrate this into a C++ program. Does it need ATen or is there a lower level of integration? Is there even the possibility of getting C bindings?

ozinenko8y ago

Tensor Comprehensions does not try to own memory allocation and CPU/GPU transfers. ATen is one simple way of getting that, which we used for tests. Anything convertible to DLPack tensors should work as long as nothing fancy happens with tensor shapes.

C bindings don't seem to be a priority.

Feel free to use the contacts provided in the documentation here: https://facebookresearch.github.io/TensorComprehensions/cont....

pathsjs8y ago

It makes sense to let client code handle allocations and whatnot. What I cannot find is how to pass one or more tensors (given, I suppose, by some shape parameters and a pointer to the data buffer) to Tensor Comprehensions.

I would have expected that operations expressed in the tensor language could be compiled once and for all (for a given target) into a DLL, and then it would be just a matter of passing the right buffer and shape parameters to a function. I see no reason why this should not be easily handled in C (which makes it easier to bind it from most languages).

If I understand correctly, DLPack is the format of choice to express a tensor, and ATen is not required, but I cannot find examples using DLPack

tomrod8y ago· 2 in thread

Neat!

This needs Python bindings, stat!

priy22018y ago

for clarity, we provide very "basic" python bindings. For example checkout https://facebookresearch.github.io/TensorComprehensions/mapp... for simple example of how we expose mapping options to python.

Also TC doesn't do data allocation itself and it requires tensor library from users to do that. So if you are using Caffe2 for example, you could use the TcOp that we ship inside TensorComprehensions for Caffe2 and using caffe2 pybindings, you can already write TC in python. No work needed for creating python bindings. We welcome PRs on this :)

As for other tensor libraries, like ATen based on TH* used in torch/pytorch, you have the ability to create tensor but these require integrating tensor library into TC and writing pybindings for them. PRs welcome on this as well.

nicovasilache8y ago

Python bindings are in there :) This needs a tensor library callable from python that works on GPUs. One direction we are going towards is PyTorch via ATen / Torch tensors; we already use the C++ parts of ATen. Of course any other CUDA tensor library with minimal alloc/copy/synchronize would work too. Send a PR? ;)

2 more replies

jabl8y ago· 1 in thread

Slightly resembles the Tensor Contraction Engine for quantum chemistry/physics (http://www.csc.lsu.edu/%7Egb/TCE/ ). Although it predates this by a couple of decades.

nicovasilache8y ago

Hello, Tensor Comprehensions absolutely use techniques that have been existing for a few years (Halide) or many decades (polyhedral model, Einstein notation(century old?), ...). The TCE is definitely also a motivational prior work which also uses the polyhedral model for optimizing loop nests. What we tried to achieve here, is a solid research tool to make a subset of underlying optimizations algorithms usable in practice by non-experts. One such optimization algorithm is described in our joint 2011 PoPL paper with authors of TCE (Loop transformations: convexity, pruning and optimization).

alexbeloi8y ago· 1 in thread

The most surprising thing to me is that they can parameterize a nontrivial section of the implementation space of a function, or that such a section exists that hasn't been optimized away by the compiler.

nicovasilache8y ago

The thing is that compilers usually quickly go to SSA form and it is not the best IR to optimize loops in. Then you fight it to extract a high-level IR and this little process makes it very easy to lose high-level information. We don't do this so we begin from a friendlier starting point.

amelius8y ago· 1 in thread

I think tensors are an overly crude way to model things. It's like computer science went back to the 60s and replaced all data structures by homogeneous blocks of memory.

Edit: of course a computer works best with blocks of memory; that doesn't mean a human developer should have the same view. As a simple example, think of the output vector of a classifier. Why is it a vector, and not a structure? Or think of the internals of an LSTM network; there is more structure in there than just tensors.

PeterisP8y ago

Tensors are the way to effectively execute things - modeling things is about the "human interface" to the data, but when you need to do stuff with very, very, very large quantities of it at a good performance, then you'd want the system to transform the data and desired operations from your "human-friendly" model to a "machine-friendly" model, and that's generally going to be homogenous blocks of memory, and "vectorizing" processing as much as possible so you have less or no item-specific logic but instead have matrix operations that do the same thing to many data items at once in parallel.

It's just like with OOP in game programming where performance matters - even if you want a nice object model for programmer convenience, you'd also want to ensure that you can store the object data into a homogeneous array sequentially in memory, as that gets you a major performance impact; I seem to recall that Carmack had an in-detail article about that some time ago, but can't easily find it.

charlescearl8y ago· 1 in thread

Out of curiosity, is there any overlap with objectives or performance of accelerate https://github.com/AccelerateHS/accelerate?

ozinenko8y ago

Tensor Comprehensions is mostly targeted at arithmetics operations that appear in DL workloads, and the notation strives to be usable for DL experts. Polyhedral optimizer is oriented towards imperative languages, so we won't end up doing the same optimizations. Really hard to compare. The spirit of making parallel programming simpler is common :)

grondilu8y ago· 1 in thread

From the documentation on arxiv:

> Variables not defined anywhere, implicitly become index variables.

That seems like a bold choice. Wasn't there a trend in programming languages, even very high level ones, to encourage variable declaration?

nicovasilache8y ago

This is one reason I also prefer to call TC a notation personally. We can't allocate and declare inside TC, that may change in the future but for now we went for the easiest entry point to Halide and polyhedral IR we could think of. You can lower simple C loops or other real languages in those IRs too but the programs are so much terser in TC that we have come to expect terseness.

j / k navigate · click thread line to collapse

59 comments

49 comments · 12 top-level

wrs8y ago· 12 in thread

>produce the high-performance codes that the machine learning community needs

grandmczeb8y ago

I've also noticed that it seems more common among Europeans, but that might be just personal experience.

ngvrnd8y ago

Also as in "simulation codes". I associate it with physicists and Fortran programs, rightly or wrongly.

stochastic_monk8y ago

"Codes" tends to refer to error-correcting codes or encoding formats in my experience, while code refers to source or assembly code for machine execution.

One is a concept/idea/format while the other is a set of instructions.

r00fus8y ago

code:codes::math:maths?

1 more reply

fdej8y ago

Yes, this is just jargon specific to the HPC / numerical analysis field.

username2238y ago

1 more reply

HumanDrivenDev8y ago

Whenever I read or that, I assume it was a language issue. As a native English speaker "codes" sounds like a mistake.

evincarofautumn8y ago

It’s common in Indian English, as in “I have written a code to do the needful, but I have a doubt about it that perhaps you could answer.” Outside that dialect I’d call it an error.

bllguo8y ago

I've seen and heard 'codes' in scientific computing, but nowhere else. Seems to support your hypothesis

DoofusOfDeath8y ago

In computational acoustics, I've seen "codes" used more commonly, especially by those who started out in the punchcard days.

nicovasilache8y ago

punchcards .. days .. ouch that's painful :)

ribs8y ago

I started working with HPC users three years ago and yes, they're the only ones who used "codes" that way.

At least, the only ones in the last 30 years, I'm figuring - there is a reason we call it "code", and it's an old one!

phaedrus8y ago· 6 in thread

ozinenko8y ago

The paper has a couple of useful references.

Otherwise, we have a site with general information on polyhedral compilation http://polyhedral.info/ and Halide has its own site http://halide-lang.org/

richardlethin8y ago

aray8y ago

Is this paper published yet? The `Article` link doesn't go anywhere, and Google Scholar doesn't know where to find a copy.

richardlethin8y ago

Also, we're hiring: https://www.reservoir.com/company/careers/

nicovasilache8y ago

@phaedrus take a look at Weld from Stanford then, may also be useful to you. Edit: in addition to what @ozinenko pointed to of course..

falcor848y ago· 4 in thread

Could someone please explain how this compares to the TensorFlow approach? I can only assume that it's omitted from the article due to marketing reasons.

davesque8y ago

p1esk8y ago

How can it be both general, and fast? For example, CuDNN ops are fast because they are very specialized and highly tuned. Is convolution written with TC going to be as fast as CuDNN convolution?

Or, if TC's strength is in its generality, then what are the advantages over something like CuPy for Chainer?

Can someone give an example where TC shines?

2 more replies

smhxOP8y ago

Tensor Comprehensions are not a deep learning framework, but a way to write new optimized operators for deep learning. Hence, Tensor Comprehensions can be integrated with TensorFlow.

liuliu8y ago

There is XLA though: https://www.tensorflow.org/performance/xla/

But I guess unlike TC / Halide / TVM, that is less generic and much less customization.

tehsauce8y ago· 4 in thread

nicovasilache8y ago

ozinenko8y ago

p1esk8y ago

What would be a relationship between TC and something like CuPy?

1 more reply

ychen3068y ago

Hand-tuning code doesn't scale to the kind of networks people write.

cgmg8y ago· 2 in thread

Related: http://tensor-compiler.org/codegen.html. This converts an expression in tensor index notation into executable code.

ozinenko8y ago

Sure, it is one of the works we cite. It seems to be mostly targeted at sparse computations and does not have GPU support.

Tensor Comprehensions does not try to manage memory and thus can be integrated into DL frameworks easily.

fredrikbk8y ago

pathsjs8y ago· 2 in thread

I could not find how to integrate this into a C++ program. Does it need ATen or is there a lower level of integration? Is there even the possibility of getting C bindings?

ozinenko8y ago

C bindings don't seem to be a priority.

Feel free to use the contacts provided in the documentation here: https://facebookresearch.github.io/TensorComprehensions/cont....

pathsjs8y ago

If I understand correctly, DLPack is the format of choice to express a tensor, and ATen is not required, but I cannot find examples using DLPack

tomrod8y ago· 2 in thread

Neat!

This needs Python bindings, stat!

priy22018y ago

nicovasilache8y ago

2 more replies

jabl8y ago· 1 in thread

Slightly resembles the Tensor Contraction Engine for quantum chemistry/physics (http://www.csc.lsu.edu/%7Egb/TCE/ ). Although it predates this by a couple of decades.

nicovasilache8y ago

alexbeloi8y ago· 1 in thread

nicovasilache8y ago

amelius8y ago· 1 in thread

I think tensors are an overly crude way to model things. It's like computer science went back to the 60s and replaced all data structures by homogeneous blocks of memory.

PeterisP8y ago

charlescearl8y ago· 1 in thread

Out of curiosity, is there any overlap with objectives or performance of accelerate https://github.com/AccelerateHS/accelerate?

ozinenko8y ago

grondilu8y ago· 1 in thread

From the documentation on arxiv:

> Variables not defined anywhere, implicitly become index variables.

That seems like a bold choice. Wasn't there a trend in programming languages, even very high level ones, to encourage variable declaration?

nicovasilache8y ago

j / k navigate · click thread line to collapse