undefined | Better HN

0 pointszackmorris4y ago0 comments

Well, this all goes back to when I was heavily into C++, assembly and blitters in the mid to late 90s when I was trying to run a shareware game business. I realized almost immediately that the real bottleneck in games is memory bandwidth, not processing power. This was right at the time that Quake III came out and everyone was trying to get a Voodoo2, I think it was? CPUs with FPUs had only gone mainstream maybe 5 years before that, and people were still arguing about Pentium vs 486 DX4. I was on Mac, but I don't think I even had a PowerPC yet.

Then everyone got video cards and CPU performance stopped improving almost overnight. Sure, we got 200 MHz Pentium IIs, and then Intel jumped warp speed into 1 GHz and then 2GHz and then 3 GHz... but single threaded performance wasn't any faster, and even today is only maybe 3 times faster than it was then, per clock cycle. What really happened is that all of the chip area went to branch prediction and caching.

When chips went from a few million transistors to a billion, I started asking why we couldn't just put dozens or hundreds of the old CPU cores on the new chips. As we all saw though, nobody listened or cared about that. So today we have behemoth chips that still choke when the web browser has a lot of tabs open.

Chips today have maybe 8 or 16 cores, and that's great. But it's 2 orders of magnitude less than the transistor budget could support. Apple's M1 is loosely trying to do what I'm asking. But it's making the mistake of having all of these proprietary/dedicated cores for SIMD stuff. I would scrap all of that, and go with a 2D array of general-purpose cores, each with their own local memories, communicating using web metaphors like content-addressable memory.

In fairness, I think the reason that real multicore CPUs never caught on, is that we didn't have the languages to utilize them. But today we have Matlab and various Lisps and higher order methods that auto-parallelize loops by treating them as transformations on arrays. All of our languages should have been auto-parallelized by now anyway. And not with SIMD optimization magic, I mean by statically analyzing code and converting it all first into higher order methods, then optimizing that intermediate code (I-code) so that the block copies are spread over multiple cores and memories. I can't remember the term for this, it's basically divide and conquer though, for example if fork/join scope was limited to a single function by the runtime. Scatter gather and map reduce are other terms for this.

So right now we have to deal with promises and async and other patterns (I consider patterns an anti-pattern) when we could just be using an ordinary language like Javascript or C, auto-parallelized to run on 256+ cores with something like terabytes per second of bandwidth, running many thousands of times faster than computers today, for far less effort because it appears as a single thread of execution. Then OpenCL or OpenGL or anything else could run like any other library above that, for people that prefer a higher-level interface.

0 comments

1 comments · 1 top-level

klelatti4y ago

Hi, Thanks for the extensive reply - a lot to digest and reflect on!

First of all I think a broadly agree with the direction of your argument. In the early 2000s the decision was made to focus on single core performance and SIMD extensions rather than embrace a massively multicore future. I guess Intel got burned by Itanium and decided that 100% compatibility with existing software was essential.

I think that road has run out now. Single core performance improvements have slowed and big SIMD is dying (hello AVX 512!). Desktop core counts are stuck but on the server you can use 128 core EC2 instances. How long before this appears in a box on your desk?

Massively multicore GPUs have taken over ML but having tried to use GPUs for general purpose computing there are huge issues - eg the overhead in transferring data and limited GPU memory sizes. The good news is that if you use the right tools you can use say OpenCL and write to run on both CPU and GPU and take advantage of increasing core counts on both.

So I think we’re on the cusp of a change: much higher CPU core counts and developers having the tools to make use of those cores.

A couple of ps

It will be interesting to see whether someone tries putting lots of simple in order cores on a single die (I think there are early RISCV attempts at this).

The transputer in the 1980s was an early experiment in massively multicore CPU systems.

The Arm team knew early on that memory bandwidth was key and focused on that with the Arm1 (and were rejected by Intel when they asked for a higher bandwidth x86 core). The rest is history!

j / k navigate · click thread line to collapse