In the end, they just put a dedicated coprocessor directly with their memory chip. They named it AI because buzzword and marketing bullshit...
They named it AI because it massively boosts embarrassingly parallel workloads. You can think of Processing In Memory as rendering the mapPartitions() operation free in Spark's MapReduce ML workloads.
Some algorithms like DNA sequencing have a tradeoff between map and reduce [1]: you spend more time generating higher quality matches between the short sequences (map), before sending them for global matching (reduce). And PIM lets you exploit that.
For an order of magnitude: the average Intel has about 60GB/s of RAM bandwidth per socket. 256 GB of UPMEM's RAM let you have 2.5TB/s of local bandwidth to a computation unit (to 2560 'dumb' cores @400Mhz) [2].
[1] https://www.researchgate.net/publication/346703874_Variant_C...
It strikes me these processors would be most helpful in pre-multiplies, filter operations, and perhaps for scatters. All that stuff is not just relevant to tensorflow / pytorch stuff but also databases. While I’m sure the “AI” labeling is pure marketing, I’d imagine Samsung would love to target workloads beyond deep learning training and inference.
(Meaning, relatively small chiplet AI processors with ram stacked on top of them.)
The reason for that is that as the precision used for the coefficients has gone down, the relative energy cost of doing computation on them has turned into a rounding error when compared to the cost of moving data to the alus, and in AI there is very little cost of distributing the processing power into many small chips, which are relatively far away from each other.
[1] O. Mutlu, S. Ghose, J. Gomez-Luna, R. Ausavarungnirun, A Modern Primer on Processing in Memory. https://arxiv.org/abs/2012.03112
32 bit multiplication : 3 pJ [1]
The energy savings come from not transporting data.
[1] http://www.sigmod2014.org/damon/slides/picojoule.kozyrakis.p...
https://ieeexplore.ieee.org/document/9073325
Edit:
A more accessible (in both senses) survey paper on Near-Memory Computing:
Tip: if you contact the author(s) of a paper that is of interest to you and ask for a version of it, there's a good chance that they'll gladly accommodate. I think generally authors don't even have any financial benefit if you pay for the paper (it all goes to the publisher).
(obligatory scihub reference)
Hackable mainly because of the nature of neural networks - their architecture matters.
> Vastly more capable systems
I interpret this as specialized silicon that's mass produced? I urge you to remember how much academics and hobbyists gain from having FPGAs around, despite their relative bulkiness and mediocre parameters.
>Circuit and design techniques are presented for enhancing the performance and reliability of a 3-D-stacked high bandwidth memory-2 extension (HBM2E). A data-bus window extension technique is implemented to cope with reduced clock cycle time ranging from data-path architecture, through-silicon via (TSV) placement, and TSV-PHY alignment. A power TSV placement in the middle of array and at the chip edge along with a dedicated top metal for power mesh improves power IR drop by 62%. An on-die ECC (OD-ECC) scheme featuring a self-scrubbing function is designed to be orthogonal to system ECC. An uncorrectable bit error rate (UBER) is improved by 10 5 times with the proposed OD-ECC and scrubbing scheme. A memory built-in self-test (MBIST) block supports low-frequency cell and core test in a parallel manner and all channel at-speed operation with adjustable ac parameters. The proposed parallel-bit MBIST reduces test time by 66%. A 16-GB HBM2E fabricated in the second generation of 10-nm class DRAM process achieves a bandwidth up to 640 GB/s (5 Gb/s/pin) and provides a stable bit-cell operation at a high temperature
None of the items in Abstract has anything to do with AI.
> Rapidly evolving artificial intelligence (AI) technology, such as deep learning, has been successfully deployed in various applications, such as image recognition, health care, and autonomous driving. Such rapid evolution and successful deployment of AI technology have been possible owing to the emergence of accelerators, such as GPUs and TPUs, that have a higher data throughput.
Edit: You might be right, I peeked into the ISSCC programme looking for something relevant from Samsung, and they are presenting a paper titled "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications". However, there is a lot of overlap in paper authors, so I'd imagine it's the same team.
But seriously though, it seems to answer an ancient techie question of mine: Since we're strobing memories millions/billions times per second, couldn't they be doing more than storage with all those clocks?
It might open the door to more sexy error correction or caching.
However, i read the title as: "We couldn't think of anything good about the product, so we added a buzzword in fashion."
Same comments mostly say this has nothing to do with AI.