One of the reasons why M1 is good is pure and simple that it has a pretty enormous transistor budget, not solely because it's ARM.
It's also very hard to achieve more than 4X parallelism (though I think Ice Lake got 6X at some additional cost) in decode, making instruction level parallelism harder. X86's hack to get around this is SMT/hyperthreading to keep the core fed with 2X instruction streams, but that adds a lot more complexity and is a security minefield.
Last but not least: ARM's looser default memory model allows for more read/write reordering and a simpler cache.
ARM has a distinct simplicity and low-overhead advantage over X86/X64.
Furthermore, the high-performance ARM designs, starting with the Cortex-A77, started using the same trick---the 6-wide execution happens only when instructions are being fed from the decoded macro-op cache.
What percent of the die is an ARM instruction decoder?
I'm not familiar with how ARM's memory model effects the cache design - Source?
There's a lot of brute force, yes, but it's not the only reason. There are lots of smart design decisions as well.
Plus, most of the last decade software is software that runs on some sort of VM or another (be it JVM, CLR, a Javascript engine or even LLVM).
Soon (in years), x86 will only be needed by professionals that are tied to really old software. And those particular needs will probably be satisfied by decent emulation.
For example, the M1 has 128 bit wide memory. This has been standard for decades on the desktop(dual channel), but unheard of in cellphones. The M1 also has similar amounts of cache to the new AMD and Intel chips, but thats several times more than the latest snapdragon. Qualcomm also doesn't just design for the latest node. Most of their volume is on cheaper, less dense nodes.
The M1 isn't necessarily a win for Arm in general. Other manufacturers weren't competing before and its yet to be seen if they will.
Currently Apple is the only company making performance-competitive ARM cores that can make a reasonable justification for an architecture switch.
Otherwise AMD's CPUs are still ahead of everyone else, including all other ARM CPU cores not made by Apple. And even Intel is still faster in places where performance matters more than power efficiency (eg, desktop & PC gaming)
upd: oh also in the HPC world, Fujitsu with the A64FX seems to be like the best thing ever now
But it can't just be competitive it needs to be significantly better in order for the consumer space to care. Nobody is going to run Windows on ARM just to get equivalent performance to Windows on X86, especially not when that means most apps will be worse. That's what's really impressive about the M1, and so far is very unique to Apple's ARM cpus.
> oh also in the HPC world, Fujitsu with the A64FX seems to be like the best thing ever now
A64FX doesn't appear to be a particularly good CPU core, rather it's a SIMD powerhouse. It's the AVX-512 problem - when you can use it, it can be great. But you mostly can't, so it's mostly dead weight. Obviously in HPC space this is different scenario entirely, but that's not going to translate to consumer space at all (and it's not an ARM advantage, either - 512bit SIMD hit consumer space via x86 first with Intel's Rocket Lake).
but one factor that you can replicate is colocating memory, CPU, and GPU, the system-on-chip architecture. that's what Nvidia looks to be going after with Grace, and I'm sure they've learned lessons from their integrated designs e.g. Jetson. very excited to see how this plays out!
Not really, they are still just using the same ARM ISA as everyone else. The only hardware/software integration magic of the M1 so far seems to be the x86 memory model emulation mode, which others could definitely replicate.
> but one factor that you can replicate is colocating memory, CPU, and GPU, the system-on-chip architecture.
AMD introduced that in the x86 world back in 2013 with their Kavari APU ( https://www.zdnet.com/article/a-closer-look-at-amds-heteroge... ), and it's been fairly typical since then for on-die integrated GPUs on all ISAs.
The current fastest supercomputer uses ARM.
More broadly, as to why the ISA doesn't make a big difference: The major differences are at the microarchitecture level since OoO processors have such flexible dataflow machinery in them that you can kind of view the frontend as compiler technology. x86 and ARM are decades-old ISAs that have seen a many many rounds of iteration in form of added instructions and even backwards incompatible reboots at the 64-bit transition points so most hinderances have been fixed.
In the olden days ISAs were important because processors were orders of magniture simpler, and instructions were processed as-is very statically (to the point that microarchitectural artifacts like branch delay slots were enshrined in some ISAs). This meant that eg the complexity of individual instructions could a bottleneck to how fast a chip could be clocked. Or in CISC land your ISA might have been so complex that the CPU was a microcoded implementation of the ISA and didn't have any hardwired fast instructions...
The near future. A few years out, RISC-V is gonna change everything.
The magic of Apple's M1 comes from the engineers who worked on the CPU implementation and the TSMC process.
The architecture has some impact on performance but I think it is simplicity and and ease of implementation that factors most into how well it can perform (as per the RISC idea). In that sense Intel lags for small, fast and efficient processors because their legacy architecture pays a penalty for decoding and translation (into simpler ops) overhead. Eventually designs will abandon ARM for RISC-V for similar reasons as well as financial ones.
Really, today it's a question of who has the best implementation of any given architecture.
I have no idea what Apple's plans for the M1 chip are, but if they had manufacturing capacity, they could put oodles of these chips into datacenters and workstations the world over and basically eat the x86 high-performance market. The fact that the chip uses so little power (15W) means they can absolutely cram them into servers where CPUs can easily consume 180W. That means 10x the number of chips for the same power, and not all concentrated in one spot. A lot of very interesting server designs are now possible.
With Nvidia, buying Arm and producing their own chip sets, that's no small advantage for companies that are not Nvidia (or Apple who have a perpetual license already). If I were Intel, that's what I'd be looking at right now. Same for perhaps AMD. The clock is ticking on their x86 only strategy and it takes time to develop new architectures; even if you do license somebody else's instruction set.
A counter argument to this would be software compatibility. Most of the porting effort to make linux, windows, and mac os run on Arm has already happened years ago. It's a mature software ecosystem. Software is actually the hardest part of shipping new hardware architectures. Without that, hardware has no value.
And a counter argument to that is that Apple is showing instruction set emulation actually works reasonably well: it is able to run x86 software at reasonable performance on the M1. So, running natively matters less these days. If you look at Qemu, they have some interesting work going on around e.g. emulated GPU where the goal is not to emulate some existing GPU but to create a virtual only GPU device called Virgil 3D that can run efficiently on just about anything that supports opengl. Don't expect to set fps records of course. The argument here is that the software ecosystem is increasingly easy to adapt to new chip architectures as a lot of stuff does not require access to bare metal. Google uses this strategy with Android: native compilation happens (mostly) just in time after you ship your app to the app store.
But a million (est) new general purpose ARM computers hitting the population certainly affects the prioritizing of ARM issues in a bug tracker.
How many compilers didn't support ARM?
While there are using some future ARM core, and I've read rumors that future designs might try to emulate what has made Apple cores successful; we cannot say whether Apple designs will stagnate or continue to improve at current rate.
There is potential for competition from Qualcomm after their Nuvia acquisition though.
The 40 core Xeon also costs around 10k.
There's rumors that the new iMac will have a 20 core M1 (16+4). I imagine that will be faster than even the top line $10k Xeon.
I have absolutely no doubt apple could put together a server based on the M1 which would wipe the floor with Intel if they wanted to. But I very much doubt they will since it is so far out of their core competencies these days.
I have absolutely no doubt apple could produce a ridiculously good server CPU from the M1. I doubt they will actually do it though.
No the Apple Silicon chips use the arm _instruction set_ but they do not use their core design. Apple designs their core in house, much like Qualcomm does with snapdragon. Both of these companies have an architectural license which allows them to do this.
It may be they don't want to detract from focus on the GPUs for vector computation so prefer a CPU without much vector muscle.
Also interesting that they're picking up an arm core rather than continuing with their own design. Something to do with the potential takeover (the merged company would only want to support so many micro-architectural lines)?
It's all greenfield and growing so far, they'll win more by having the very best products they can make on both sides.
There was no information whether it will have any good SVE2 implementation. On the contrary they insisted only on the integer performance and on the high-speed memory interface.
I'd suspect NVidia would be using the V1 here as it's the higher performing core, but not way to be certain.
"E" is efficiency, N is standard, V is high-speed. IIRC, N is the overall winner in performance/watt. Efficiency cores have the lowest clock speed (overall use the least amount of watts/power). V purposefully goes beyond the performance/watt curve for higher per-core compute capabilities
I doubt anyone really deliberately sets out to be like "haha yessss today I shall elide this woman's credentials", but this is one of those unconscious gender-bias things that is commonplace in our society and is probably best to try and make a point of avoiding.
https://news.cornell.edu/stories/2018/07/when-last-comes-fir...
https://metro.co.uk/2018/03/04/referring-to-women-by-their-f...
(etc etc)
I'd prefer they used "Hopper" instead, in the same way they have chosen to refer to previous architectures by the last names of their namesakes (Maxwell, Pascal, Ampere, Volta, Kepler, Fermi, etc). I'd see that as being more professionally respectful for her contributions.
But yes I very much like the idea of naming it after Hopper.
Vaguely related: J. K. Rowling's "real" full name is Joanne Rowling. The publisher "thought a book by an obviously female author might not appeal to the target audience of young boys".
There's another famous (in the UK at least) computer scientist called Hopper: Andy Hopper. So "G.B.M. Hopper", perhaps? That would have more gravitas than "Andy"!
I guess I'm not sure if "Hopper" refers to the product as a whole (like Tegra) and early leakers misunderstood that, or whether Hopper is the name of the microarchitecture and "Grace" is the product, or if it's changed from Hopper to Grace because they didn't like the name, or what.
Otherwise it's a little awkward to have products named both "grace" and "hopper"...
Unfortunately, at least in most Western societies, using the first names is the only way to refer unambiguously to women.
According to the tradition, in most Western countries the women do not have their own family names, but use either the family name of their father until marriage, or the family name of their husband after that.
So while Grace is the computer scientist, Hopper is her husband and Murray is her father. Using the name Grace makes clear who is honored.
Nowadays, in many places there are laws that allow women to choose their family names or to combine the family names.
Nevertheless, the old tradition is still entrenched, so searching for a certain woman, when the last information about her is many years old, can be difficult due to unpredictable family name changes.
Ideally, a human should keep forever the family name used at birth and the parents should choose one of their family names for the children.
Ideally, a human should keep forever the family name used at birth and the parents should choose one of their family names for the children.
I prefer the Spanish way, have two family names. We have been doing it for centuries, it baffles me that other countries find it so difficult to adopt a similar system.Speaking in general terms, data rate and transaction rate don't necessarily match because a transaction might require the transmitter to wait for the receiver to check packet integrity and then issue acknowledgement to the transmitter before a new packet can be sent.
Yet another case, again, speaking in general terms, would be the case of having to insert wait states to deal with memory access or other processor architecture issues.
Simple example, on the STM32 processor you cannot toggle I/O in software at anywhere close to the CPU clock rate due to architectural constraints (to include the instruction set). On a processor running at 48 MHz you can only do a max toggle rate of about 3 MHz (toggle rate = number of state transitions per second).
PCIe has the optional "relaxed ordering" feature, allowing sending new packets before the ACK has been received from preceeding ones. Not sure precisely how this works, if there is some TCP-like window scaling algorithm in play or not..
Wait, Nvidia's been making ARM CPUs for years now; most memorably Project Denver.
But another reason they won't do it is that TSMC has a finite amount of 5nm fab capacity. They can't make more of the chips than they already do.
ARM is more of a tool kit to build different purpose built computers (you even see them show up in usb sticks). While x86 is particular ISA that has a long history behind it. So you may see something like 'Amazon builds its own ARM computers'. That means they spun their own boards, built their own toolchains (more likely recompiled existing ones), and probably have their own OS distro to match. Each one of those is a fairly large endeavor to do. When you see something like 'Amazon builds its own x86 boards', they have shaved out the other two parts of that and are focusing on hardware. That they are building their own means they see the value in owning the whole stack. Also if you have your own distro means you usually have to 'own' building the whole thing. So I can go grab an x86 gcc stack from my repo provider. They will need to act as the repo owner and build it themselves and keep up with the patches. Depending on what has been added that can be quite the task all by itself.
CPU: LPDDR5X with ECC Memory at 500+GB/s Memory Bandwidth. ( Something Apple may dip into. R.I.P for Mac with upgradable Memory )
GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a typo.
NVLink: 500GB/s
This will surely further solidify CUDA dominance. Not entirely sure how Intel's XE with OneAPI and AMD's ROCm is going to compete.
It's a good step forward but your average consumer GPU is already around a quarter to a third of that and a Radeon VII had 1000 GB/s two years ago.
I believe this leaves Apple, ARM, Fujitsu, and Marvell as the only companies currently designing and selling cores that implement the ARM instruction set. That may drop to 3 in the next generation, since it’s not obvious that Marvell’s ThunderX3 cores are really seeing enough traction to be be worth the non-recurring engineering costs of a custom core. Are there any others?
Will they auto-detect workloads and cripple performance (like the mining stuff recently)? Only work through special drivers with extra licensing feeds depending on the name of the building it is in (data center vs office)?
Still, every company does it differently.
For example, both NVIDIA and AMD compute GPUs are necessarily more expensive than gamer GPUs because of hardware costs (e.g. HBM).
However, NVIDIA gamer GPUs can do CUDA, while AMD gamer GPUs can't do ROCm.
The reason is that NVIDIA has 1 architecture for gaming and compute (Ampere), while AMD has two different architectures (RDNA and CDNA).
I understand the here-and-now AI applications. But this is smelling more like Big AI Hype than Big AI need.
They have interconnects from Mellanox, GPUs and their own CPUs now.
I suspect the supercomputing lists will be dominated by NVidia now.
Nvidia +4.68%,
Intel -4.65%
AMD -4.47%There is bottled demand because Intel's failure to deliver was not fully anticipated by anyone.
NVIDIA is buying ARM.
Multiple competition investigations permitting.
Apple is also I think going to soldered on / close in RAM. Nvidia looks to be doing this two CPU / GPU / Ram all close together and it doesn't look like any upgrade options. Some thinking was that Apple was continuing to increase durability / reliability etc with their RAM move.
Does anyone know requirements for the LPDDR5X type of ram mentioned here. Does this require soldering things (you obviously get lots more control if you spec chips yourself and solder on)?
If other companies don't make genuine investments in ARM for the desktop there's a real chance that Apple will get a huge an difficult to assail application performance advantage as application developers begin to focus on making Mac apps first, and port to x86 as an afterthought.
Something similar happened back in the day when Intel was the de facto king, and everything on other platforms was a handicapped afterthought.
I wouldn't want to have my desktops be 15 to 30% slower than Macs running the same software, simply because of emulation or lack of local optimizations.
So I'm really looking forward to ARM competition on the desktop.