“The new top system, Fugaku, turned in a High Performance Linpack (HPL) result of 415.5 petaflops, besting the now second-place Summit system by a factor of 2.8x. Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors. In single or further reduced precision, which are often used in machine learning and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1 exaflops). The new system is installed at RIKEN Center for Computational Science (R-CCS) in Kobe, Japan.
(I am not an HPC guy, just IMO).
The only US owned state of the art fabs in the US belong to Intel. Intel survives because they have a high margin on x86 CPUs. Today, TMSC announced 5nm, and the top supercomputer is ARM-based.
Apple seems to be going ARM. Chromebooks are ARM. Microsoft now offers Windows on ARM, on the Surface Pro X. Mobile never used x86. x86 is on the way out. What's left for Intel?
(Micron is still a major force in DRAM, amazingly.)
Is the "US owned" clarification to exclude Global Foundries' New York fab? :D
> Chromebooks are ARM
Maybe half of them.
> What's left for Intel?
Fabricating others' designs like TSMC?
But also, Intel isn't going away any time soon, just not being a monopoly anymore.
Technology: 28 nm and 14 nm. 7 nm planned. However, in August 2018, GlobalFoundries made the decision to suspend 7 nm development and planned production, citing the unaffordable costs to outfit Fab 8 for 7 nm production. GlobalFoundries held open the possibility of resuming 7 nm operations in the future if additional resources could be secured.
So, not a state of the art fab. Couldn't afford to keep up.
Intel reported record quarter every quarter for the last 2-3 years.
Fabless semiconductors are doing better than ever - nVidia, AMD, Apple, Qualcomm, Google TPU etc.
Most of the high performance ARM SoCs come from Apple and Qualcomm, both American companies.
Due to high demand on people needing 30%+ more processing power after they lost 30% perf due to Spectre/Meltdown. When they get the manufacturing and supply up to a resonable standard, they'll start building inventory and start selling cheaper chips again. Margin and ASP will decrease, and investors will sign out
According to https://en.wikipedia.org/wiki/Usage_share_of_operating_syste..., about 80% to 90% of the desktop and laptop computer market (counting the share of Windows + Linux devices in these categories). Intel won't starve.
Expect to see these numbers change drastically over the next years as Zen 2 finally turned the ship around on that by not only making AMD CPUs competitive, but in many cases the straight up better, yet still more affordable, choice.
Which is already reflected in current trends: Barely any consumer-level hardware outlets still recommend Intel builds, which is down to lack of PCIe 4.0 support and only very expensive Intel CPUs being able to outperform AMD CPUs in fringe-use cases like single-core performance in gaming, while still demanding a hefty price-premium.
A premium that many people are simply not willing to pay for anymore.
As a small data point just look at the top 10 CPUs on price comparison websites, like German pcgameshardware [0]: 8 out of the top 10 CPUs are all AMD.
Which will not mean that Intel will starve, but it very much puts them into the position that AMD has been in these past years, that of the underdog fighting an uphill battle to regain relevancy in the consumer sector.
That others can compete directly on price. Intel can probably do it technically, but will not have the margins they had with x86.
Are you forgetting Amazon's Graviton? And Nuvia seems to be in a good position to dethrone Apple in best ARM CPU race too.
The only thing that stops them is their will to do it.
Well I'd like to see them go all in on Desktop Linux.
But then I'm a dreamer.
I really like the NUC products. They have the knowledge and skills to do everything (cpu, ram, soc, radio/wifi/cell (or used too), storage) in house but industrial design (maybe they do). I have never personally experienced a software experience from Intel I have ever remotely enjoyed.
Edit: Actually there is software I have used from Intel that I have really enjoyed, the BIOS for the NUC. So I stand corrected.
Jimmy Kimmel Can You Name a Country? strikes again https://www.youtube.com/watch?v=kRh1zXFKC_o
There's a lot non-mainstream in this, like K, but partly influenced by K experience. Unusually, it's all apparently specifically designed for the job, from the processor to the operating system (only partly GNU/Linux). Notably, despite the innovation, it should still run anything that can reasonably be built for aarch64 straight off and use the whole node, even if it doesn't run particularly fast; contrast GPU-based systems. (With something like simde, you may even be able to run typical x86-specific code.) However, the amount of memory/core is surprising -- even less than Blue Gene Q -- and I wonder how that works out for large-scale materials science work for which it's obviously prepared. Also note Fujitsu's consideration of reliability, though the oft-quoted theory of failure rates in exascale-ish machines was obviously wrong, otherwise as the Livermore CTO said, he'd be out of a job.
The bad news for anyone potentially operating a similar system in a university, for instance, is that the typical nightmare proprietary software is apparently becoming available for it...
I think it's a limitation of the technology. HBM2 provides amazing bandwidth, but capacity is quite limited. And it's not like DIMM slots where you can just insert more of them, the memory chips are bonded to the substrate chiplet-style.
This is very similar to high-end GPU's which also use HBM2 memory, e.g. NVIDIA A100 has 40 GB.
ARM has been making major strides in the high performance area. The new AWS Graviton processors are pretty nice from what I have heard. And then there's the ARM in Mac. Yup and Julia will run on all of these!
While I say all of this, I should also point out that the top500 benchmark pretty much is not representative of most real-life workloads, and is largely based on your ability to solve the largest dense linear solve you possibly can - something almost no real application does.
(The website is down, so I haven't been able to look at the specs of the actual machine).
https://www.fujitsu.com/global/about/resources/news/press-re...
>A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance.
I wonder what BLAS they are using, and if the contributions are open sourced.
They also publish the HPCG benchmark with sparse matrixes. And unsurprisingly an order of magnitude lower flops across the board. The Fujitsu chip scales a whole lot better than the usual Nvidia GPUs though.
Sounds right.
I was going to say what about large-scale optimization problems? But I realized that most typically only require sparse linear solves.
Gradient descent does require the solution of dense Ax=b systems. But the most visible/popular application of large-scale gradient descent today, neural networks, typically use SGD which require no dense linear solves at all.
I don't have any sense of how much these cost to manufacture. There ought to be a market for a A64fx based rackmount server system. If the price isn't outrageous, I'd love to see these sold as an SBC.
IIRC the extension was specifically designed by Fujitsu for their needs, so it is not surprising.
That benchmark consists of more sparse matrixes which are a lot more realistic depiction of hpc workloads. Seems to scale a lot better with irregular access patterns than basically Nvidia GPUs on the other systems.
Edit: Turns out I was right. May be this link is better.
https://www.anandtech.com/show/15869/new-1-supercomputer-fuj...
https://www.hotchips.org/hc30/2conf/2.13_Fujitsu_HC30.Fujits...
Edit: seems to be a Fujitsu designed interconnect [0]. Wonder how much of the overall performance is dependent on the difference in communication.
https://www.fujitsu.com/global/documents/solutions/business-...
This delivers a huge boost in bandwith.
HMB stand for high memory bandwidth. It offers up to 900 GB/s.
Now if you add the tofu interconnect on top you have a systems finely tuned for maximising data movement.
Remember : compute is cheap, communication is expensive.
You can have load of gpu and processors but if you can't feed them data fast enough they are useless.
"Report on the Fujitsu Fugaku (富岳) System", Jack Dongarra, Jun 22, 2020
https://www.dropbox.com/s/aqntdb43p6so0z5/fugaku-report.pdf?...
At least with the top500 benchmark, the bandwidth is not a problem, so long as you can do a large enough problem. Since it is a linear solve that spends all its time doing matmul (n^3 operations on n^2 data), so long as the problem is big enough, you can saturate the cores.
Car manufacturers need ridiculously expensive race cars to push the technology to get advancements in everyday car technology. Similarly, Top500 is one way to push the technology for not just computationally but also things like better power, heat and network management in processors and computers in general.
With ever doubling server farms, Heat management of these systems will become the major contributor for environmental pollution than all vehicle exhaust. Rather than spending on renewable sources of energy for these farms, it makes sense to optimize energy consumed per processor. Hoping to see advancements in this area
In my opinion, next arm/intel will be the company who does energy efficient processors
No, extra heat just radiates away into space, nothing to worry about. The problem is in the GHG emissions to generate the energy to run (and yes, cool) these systems.
> it makes sense to optimize energy consumed per processor.
Yes, absolutely. Even if you don't care about the environment, power consumption limits chip performance. Doing more work with less power is key to producing faster chips.
Here's an idea, maybe we should just turn the moon into a giant data center?
We also have more power options on Earth. On the Moon you'd only really have solar power available and then only two weeks a month. On Earth you've got insolation for half a day every day and the option of other renewable sources.
This is all besides the phenomenal cost of building data centers on the Moon vs Earth.
(When I wrote this comment, I was seeing 500 status code errors when trying to load the page)
- Slides on Fugaku: https://www.fujitsu.com/global/Images/supercomputer-fugaku.p...
- SVE 512 instructions for armv8: https://www.fujitsu.com/global/Images/armv8-a-scalable-vecto...
ARM was a British company until it was bought by SoftBank a couple of years ago.
That video is from end of 2018... which makes it even more amazing.
[1]: https://trends.google.com/trends/explore?date=all&q=top500
Pretty much a GPU alike stuff with arm-8 set to be able to run its OS.
PS This is meant to be sarcastic.
This seems unlikely unless Apple has decided to sell their chips for the first time ever.
I suppose it could be interesting from Apple's PR perspective, but I have serious doubts that the supercomputer owner would agree to this. What happens to them when Apple discontinues the current chip in favor of the next one?
https://www.nextplatform.com/2019/11/22/arm-supercomputer-ca...
> The number nine system on the Green500 is the top-performing Fugaku supercomputer, which delivered 14.67 gigaflops per watt. It is just behind Summit in power efficiency, which achieved 14.72 gigaflops/watt.
The writing was on the wall that RISC would win, but the x86 juggernaut appeared unbeatable.
What do you mean by "win" exactly? RISC is just an architectural choice and means nothing on its own. For reference, Google's TPUs, which - according to Google - deliver 30-80x better performance per Watt than contemporary CPUs, use a CISC design instead [1].
This whole "RISC vs CISC"-nonsense is quite inane, given that it's a design choice that's highly application-specific.
It's even debatable, whether the A64FX can even be considered a "pure" RISC design, considering the inclusion of SVE-512 and its 4 unspecified "assistant cores" [2] ...
[1] https://cloud.google.com/blog/products/gcp/an-in-depth-look-...
[2] https://www.fujitsu.com/jp/Images/20180821hotchips30.pdf
Now, although Linpack is a better evaluation metric for a supercomputer than simply totaling up # of processors and RAM size, it's still a very specific benchmark of questionable real-world utility; people like it because it gives you a score, and that score lets you measure dick-size, err, computing power. It also, if you're feeling unscrupulous, lets you build a big worthless Linpack-solving machine which generates a good score but isn't as good for real use (an uncharitable person might put Roadrunner https://en.wikipedia.org/wiki/Roadrunner_(supercomputer) in this category)
Combine this with the fact that many applications are limited by network throughput rather than by CPU/SSD/RAM/PCIE, and performance becomes a hard thing to quantify even in terms of "how many ARM cores do i need to buy to make my CPU not be the bottleneck"
There are benchmarks for ARM linux compilation and ARM openjdk performance benchmarks which are a good start, but I don't know how to compare SKUs between those ARMs and the ones found in top500 supercomputers.
But the more interesting question for me is: on an embarrassingly parallel workload, how does Amazon’s full infrastructure compare to these top machines? I’d assume that Amazon keeps that a secret.
Either way you slice it, Amazon likely owns at least an order of magnitude more FLOPS than any single system on the top500. What they presumably don't have is the low latency interconnects, etc., needed for traditional supercomputing.
[1] https://datacenterfrontier.com/amazon-approaches-1-gigawatt-...
[2] https://www.eenews.net/stories/1060048034
[3] https://sustainability.aboutamazon.com/sustainable-operation...
A major part of what makes these machines special is their interconnect. Fujitsu is running a 6D torus interconnect with latencies well in the sub-usecond range. The special sauce is ability of cores to interact with each other with extreme bandwidth at extremely low latencies.
What software would one use to distribute some workload between these two nodes, what would the latency and bandwidth be bottlenecked by (the network connection?) and what other key statistics would be important in measuring exactly how this cheap $400 (used) set up compares to price/watt/flop performance for top 500 computers?
If you connect a lot of cloud instances to act as a giant distribute computing cluster they’ll receive/share/return data via network interfaces or yet worse the internet, this is really really slower than what super computers do.
For many applications that solution would be more efficient than a supercomputer, but for applications that need a supercomputer it would be inefficient. It just depends on what you need to do, but in any case it would be a computing cluster not a supercomputer.
(my two cents, I’m not into that field)
"""The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server. Compared with other machines listed on the TOP500 supercomputers in the world, it ranks in the top five, Microsoft says."""
Each of the big three cloud providers could, if they chose to, build a #1 Top500 computer using what they have available (that includes the CPUs, GPUs, and interconnect). That said, it's unclear why they would: the profit would be lower than if they sold the equivalent hardware, without the low-latency interconnect, to "regular" customers. The supercomputer business isn't obscenely profitable.
- Moore's law is dead at the level of the transistor
- Architecture, HPC updates will keep coming for many years into the future
- AGI has already escaped Moore's law (i.e., development of a fully functional AGI will not be constrained by lack of Moore's law progress). And that's what really matters.
- Related note on AGI: it has escaped the data problem as well (as in we have the right kind of sensors: mainly cameras, microphones, and so on). That is, according to the categorization of AGI challenges in terms of hardware, data, algorithms, the only missing piece is the right set of algorithms.
- Somewhere between 2015 and 2025, multiple individual groups will have cracked the AGI problem independently. (but 2015 is in the past, which means there are likely groups out there that have cracked the problem and keeping it a secret).
- AGI-in-the-basement scenario is very doable and has been or will be done, many times over.
Ethernet may not be approaching Infiniband in raw speed and latency, but I think it's doing pretty decently with 10 Gbps going to every node.
Ethernet networks are definitely much more competitive today than 10 years ago, when Infiniband already had cheap 20 Gbps network cards but 10 Gbps Ethernet card were expensive and the network switches were a rarity.
TOP500 is not very important, but it's not meant to be much more than a simple benchmark that correlates roughly with real performance. Supercomputers aren't designed just to be number 1 on TOP500, it's a byproduct of their actual goal.
I'd expect in the long run, there will be only few data center operators.