Samsung Develops Industry’s First High Bandwidth Memory with AI Processing Power (opens in new tab)

(news.samsung.com)

82 pointsshihab5y ago38 comments

38 comments

32 comments · 9 top-level

greatgib5y ago· 10 in thread

Does not look like very fancy or innovative.

In the end, they just put a dedicated coprocessor directly with their memory chip. They named it AI because buzzword and marketing bullshit...

BenoitP5y ago

> They named it AI because buzzword and marketing bullshit...

They named it AI because it massively boosts embarrassingly parallel workloads. You can think of Processing In Memory as rendering the mapPartitions() operation free in Spark's MapReduce ML workloads.

Some algorithms like DNA sequencing have a tradeoff between map and reduce [1]: you spend more time generating higher quality matches between the short sequences (map), before sending them for global matching (reduce). And PIM lets you exploit that.

For an order of magnitude: the average Intel has about 60GB/s of RAM bandwidth per socket. 256 GB of UPMEM's RAM let you have 2.5TB/s of local bandwidth to a computation unit (to 2560 'dumb' cores @400Mhz) [2].

[1] https://www.researchgate.net/publication/346703874_Variant_C...

[2] https://www.upmem.com/technology/

choppaface5y ago

I feel like push-down operations might be a better analogy from the mapreduce world?

It strikes me these processors would be most helpful in pre-multiplies, filter operations, and perhaps for scatters. All that stuff is not just relevant to tensorflow / pytorch stuff but also databases. While I’m sure the “AI” labeling is pure marketing, I’d imagine Samsung would love to target workloads beyond deep learning training and inference.

1 more reply

ksec5y ago

Well PIM or Processing In Memory or Computational Memory [1] isn't new. The Question is what exactly did they put in those memory or what are the target performance speed up in that specific domain. This PR provides embarrassingly little details.

[1] https://en.wikipedia.org/wiki/Computational_RAM

1 more reply

why_Mr_Anderson5y ago

embarrassingly parallel workloads - so...GPU?

1 more reply

Tuna-Fish5y ago

AI is one of the very few domains where processing-in-memory makes sense. This particular system is a little slapdash, but I feel strongly that in the very near future, this is the only architecture that will be used for AI.

(Meaning, relatively small chiplet AI processors with ram stacked on top of them.)

The reason for that is that as the precision used for the coefficients has gone down, the relative energy cost of doing computation on them has turned into a rounding error when compared to the cost of moving data to the alus, and in AI there is very little cost of distributing the processing power into many small chips, which are relatively far away from each other.

krona5y ago

By AI you mean matrix multiplication on single/half-precision floats? You could be a bit more specific.

captain_price75y ago

AI is likely to be the biggest beneficiary of this architecture. The on-memory processing chips are likely to be simpler than the CPU ones (i.e. more akin to GPU cores), and allows parallelism- both of which point to numerical processing and AI.

jabberwcky5y ago

I like the idea of flipping DIMMs to get capacity and processing improvements, and also the thought that mass-produced memory with this tech could potentially significantly reduce the cost of AI hardware through commoditization

jjcon5y ago

I don’t think many people outside of AI will have a use for this (at least at first) so why would you market it any other way?

smolder5y ago

Many computing tasks are very parallel as well as depending on memory bandwidth. I think it'd be useful for almost any of them.

jabberwcky5y ago· 5 in thread

Dubious energy savings claims, but sounds like potentially awesome tech. Looking forward to their slides/paper next week

GregarianChild5y ago

Another recent primer on in-memory / near-memory computing in [1]. Upmem [2] is also selling memory with on-board compute. A space that is slowly hotting up!

[1] O. Mutlu, S. Ghose, J. Gomez-Luna, R. Ausavarungnirun, A Modern Primer on Processing in Memory. https://arxiv.org/abs/2012.03112

[2] https://www.upmem.com/

BenoitP5y ago

Transport uncached 32 bits from RAM: 650 pJ

32 bit multiplication : 3 pJ [1]

The energy savings come from not transporting data.

[1] http://www.sigmod2014.org/damon/slides/picojoule.kozyrakis.p...

shihabOP5y ago

A relevant paper from '19 (Behind paywall) -

https://ieeexplore.ieee.org/document/9073325

Edit:

A more accessible (in both senses) survey paper on Near-Memory Computing:

https://arxiv.org/abs/1908.02640.pdf

virgilp5y ago

> Behind paywall

Tip: if you contact the author(s) of a paper that is of interest to you and ask for a version of it, there's a good chance that they'll gladly accommodate. I think generally authors don't even have any financial benefit if you pay for the paper (it all goes to the publisher).

jabberwcky5y ago

Thanks

(obligatory scihub reference)

loa_in_5y ago· 2 in thread

I speculate that the eventual ideal goal to strive towards will be RAM strip-to-strip processing taking all of one module's data, feeding one layer and dumping results into the next module. The individual layers accessible for both read and write as ordinary RAM.

plutonorm5y ago

This is a great step and all, but shouldn't we be being a little more adventurous? A unified understanding of computation and thermodynamics has the potential to enable systems that are vastly more capable. We are piddling around in the shallow end making incremental improvements. A few billion thrown in novel directions could reap extraordinary rewards.

https://arxiv.org/abs/1911.01968

loa_in_5y ago

I understand where are you're coming from, but I would rather see modular and stackable pieces affordable by ordinary users and hackable for power users.

Hackable mainly because of the nature of neural networks - their architecture matters.

> Vastly more capable systems

I interpret this as specialized silicon that's mass produced? I urge you to remember how much academics and hobbyists gain from having FPGAs around, despite their relative bulkiness and mediocre parameters.

tmotwu5y ago· 2 in thread

Paper with more details: https://ieeexplore.ieee.org/document/9240974

ksec5y ago

Is this really the same thing? I dont have an account so I couldn't read the whole thing.

>Circuit and design techniques are presented for enhancing the performance and reliability of a 3-D-stacked high bandwidth memory-2 extension (HBM2E). A data-bus window extension technique is implemented to cope with reduced clock cycle time ranging from data-path architecture, through-silicon via (TSV) placement, and TSV-PHY alignment. A power TSV placement in the middle of array and at the chip edge along with a dedicated top metal for power mesh improves power IR drop by 62%. An on-die ECC (OD-ECC) scheme featuring a self-scrubbing function is designed to be orthogonal to system ECC. An uncorrectable bit error rate (UBER) is improved by 10 5 times with the proposed OD-ECC and scrubbing scheme. A memory built-in self-test (MBIST) block supports low-frequency cell and core test in a parallel manner and all channel at-speed operation with adjustable ac parameters. The proposed parallel-bit MBIST reduces test time by 66%. A 16-GB HBM2E fabricated in the second generation of 10-nm class DRAM process achieves a bandwidth up to 640 GB/s (5 Gb/s/pin) and provides a stable bit-cell operation at a high temperature

None of the items in Abstract has anything to do with AI.

tmotwu5y ago

Ah right it's behind a paywall, sorry. The introduction opens with:

> Rapidly evolving artificial intelligence (AI) technology, such as deep learning, has been successfully deployed in various applications, such as image recognition, health care, and autonomous driving. Such rapid evolution and successful deployment of AI technology have been possible owing to the emergence of accelerators, such as GPUs and TPUs, that have a higher data throughput.

Edit: You might be right, I peeked into the ISSCC programme looking for something relevant from Samsung, and they are presenting a paper titled "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications". However, there is a lot of overlap in paper authors, so I'd imagine it's the same team.

spacemanmatt5y ago· 1 in thread

Now that's what I call edge computing.

But seriously though, it seems to answer an ancient techie question of mine: Since we're strobing memories millions/billions times per second, couldn't they be doing more than storage with all those clocks?

artemonster5y ago

Its always a trade-off: you have to balance the raw area of packed memory cell rows against all the „support“ fluff: column precharge circuitry, readout buffers, adress decoders, etc. I am also unsure whether non-uniform random access times would break some of the abstractions about RAM memory as well. In NAND flash that sort of page-bank parallelism is integrated, since operations are slow.

pulse75y ago· 1 in thread

Can you use this processing-in-memory (PIM) to perform garbage collection in memory? (Like in this article from RISC-V board member Krste Asanovic: https://people.eecs.berkeley.edu/~krste/papers/maas-isca18-h...)

tgtweak5y ago

I think you would still need to interpret the output outside of the pim so it wouldn't be an on-dimm or universal system by any means.

It might open the door to more sexy error correction or caching.

phendrenad25y ago· 1 in thread

Aren't there companies already putting CPUs in RAM? This isn't anything new.

tmotwu5y ago

Not the first PIM, it claims to be the first industry HBM-PIM for DL/ML. It's actually a practical use case for hardware DL.

dekhn5y ago· 1 in thread

All of computing is an exercise in moving compute closer to the data.

tromp5y ago

That's a futile exercise if the computation involves repeatedly combining random bits of data.

nottorp5y ago

Hmm the HN comments say that it's kinda interesting.

However, i read the title as: "We couldn't think of anything good about the product, so we added a buzzword in fashion."

Same comments mostly say this has nothing to do with AI.

1 more reply

j / k navigate · click thread line to collapse

38 comments

32 comments · 9 top-level

greatgib5y ago· 10 in thread

Does not look like very fancy or innovative.

In the end, they just put a dedicated coprocessor directly with their memory chip. They named it AI because buzzword and marketing bullshit...

BenoitP5y ago

> They named it AI because buzzword and marketing bullshit...

[1] https://www.researchgate.net/publication/346703874_Variant_C...

[2] https://www.upmem.com/technology/

choppaface5y ago

I feel like push-down operations might be a better analogy from the mapreduce world?

1 more reply

ksec5y ago

[1] https://en.wikipedia.org/wiki/Computational_RAM

1 more reply

why_Mr_Anderson5y ago

embarrassingly parallel workloads - so...GPU?

1 more reply

Tuna-Fish5y ago

(Meaning, relatively small chiplet AI processors with ram stacked on top of them.)

krona5y ago

By AI you mean matrix multiplication on single/half-precision floats? You could be a bit more specific.

captain_price75y ago

jabberwcky5y ago

jjcon5y ago

I don’t think many people outside of AI will have a use for this (at least at first) so why would you market it any other way?

smolder5y ago

Many computing tasks are very parallel as well as depending on memory bandwidth. I think it'd be useful for almost any of them.

jabberwcky5y ago· 5 in thread

Dubious energy savings claims, but sounds like potentially awesome tech. Looking forward to their slides/paper next week

GregarianChild5y ago

Another recent primer on in-memory / near-memory computing in [1]. Upmem [2] is also selling memory with on-board compute. A space that is slowly hotting up!

[1] O. Mutlu, S. Ghose, J. Gomez-Luna, R. Ausavarungnirun, A Modern Primer on Processing in Memory. https://arxiv.org/abs/2012.03112

[2] https://www.upmem.com/

BenoitP5y ago

Transport uncached 32 bits from RAM: 650 pJ

32 bit multiplication : 3 pJ [1]

The energy savings come from not transporting data.

[1] http://www.sigmod2014.org/damon/slides/picojoule.kozyrakis.p...

shihabOP5y ago

A relevant paper from '19 (Behind paywall) -

https://ieeexplore.ieee.org/document/9073325

Edit:

A more accessible (in both senses) survey paper on Near-Memory Computing:

https://arxiv.org/abs/1908.02640.pdf

virgilp5y ago

> Behind paywall

jabberwcky5y ago

Thanks

(obligatory scihub reference)

loa_in_5y ago· 2 in thread

plutonorm5y ago

https://arxiv.org/abs/1911.01968

loa_in_5y ago

I understand where are you're coming from, but I would rather see modular and stackable pieces affordable by ordinary users and hackable for power users.

Hackable mainly because of the nature of neural networks - their architecture matters.

> Vastly more capable systems

tmotwu5y ago· 2 in thread

Paper with more details: https://ieeexplore.ieee.org/document/9240974

ksec5y ago

Is this really the same thing? I dont have an account so I couldn't read the whole thing.

None of the items in Abstract has anything to do with AI.

tmotwu5y ago

Ah right it's behind a paywall, sorry. The introduction opens with:

spacemanmatt5y ago· 1 in thread

Now that's what I call edge computing.

artemonster5y ago

pulse75y ago· 1 in thread

tgtweak5y ago

I think you would still need to interpret the output outside of the pim so it wouldn't be an on-dimm or universal system by any means.

It might open the door to more sexy error correction or caching.

phendrenad25y ago· 1 in thread

Aren't there companies already putting CPUs in RAM? This isn't anything new.

tmotwu5y ago

Not the first PIM, it claims to be the first industry HBM-PIM for DL/ML. It's actually a practical use case for hardware DL.

dekhn5y ago· 1 in thread

All of computing is an exercise in moving compute closer to the data.

tromp5y ago

That's a futile exercise if the computation involves repeatedly combining random bits of data.

nottorp5y ago

Hmm the HN comments say that it's kinda interesting.

However, i read the title as: "We couldn't think of anything good about the product, so we added a buzzword in fashion."

Same comments mostly say this has nothing to do with AI.

1 more reply

j / k navigate · click thread line to collapse