1/ What specific design choices make the M1 so much better than the equivalent intel chips ? It looks like there are a bunch of changes -- 5nm silicon, single memory pool, a combination of high efficiency and high power cores. Can someone explain to me how each of these changes helps apple achieve the gains it did ? Are these breakthrough architectural changes in chip design or have these been discussed before?
2/ How did apple manage to create this when intel has been making chips for decades and that is the singular focus of the company ? Is it the fact that Mac OS could be better optimized for the M1 chips ? Given the design changes that doesn't seem like the only reason.
My 10,000' view understanding:
- Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process
- more efficient transistors. 7nm uses finFet. M1 probably uses GAAFET, which means you can cram more active gate volume in less chip footprint. This means less heat and more speed
- layout. M1 is optimized for Apple's use case. General purpose chips do more things, so they need more real estate
- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks
- M1 has tons of cache, IIRC, something like 3x the typical amount
- some fancy stuff that allows them to really optimize the reorder buffer, decoder, and branch prediction, which also leverages all that cache
Specialization helps when it helps, but doesn't do much on typical programs and M1 still excels on those.
Specifically there is a reference counting optimization on the M1 that dramatically helps performance of compiled Swift apps - something only worthwhile if you know the majority of what the chip will ever do is run swift apps.
that wouldn't help in benchmarks, would it? Only in the most dishonest benchmarks would they compare x264 to a hardware h.264 encoder, for instance.
Example: I run a benchmark for AES encryption - a modern CPU will have circuitry designed explicitly for this task and it's asm instructions. An old CPU just supporting the base x86 instructions probably doesn't have a hardware solution. Is it unfair to compare them?
If the utilisation of the hardware accelerators is completly opaque to the user (not importing special libraries) is it unfair that one CPU has specific hardware implementation for common tasks and one only has the generic circuitry?
Intel can probably make faster stuff than they currently do but then their customers (PC manufacturers for instance) would have to modify all their stuff as well and they don't want to, or at least, won't want the same thing.
The real explanation is Intel has been complacent and lazy. We had 5 generations of the same chip. Enough is enough.
This is a crucial point. It's a coordination problem.
Drivers only need to be made for those components that Apple choose for their system vs the multitude combinations of GPU boards, drive systems, motherboard chips, wifi etc etc. that exist for a modern PC.
There's no steering committee for PCs (at least none that could be incorruptible by Intel) that could cause this change to happen industry wide. And there's been little appetite (yet) for Windows/Linux + ARM for consumer PCs to help this happen from the bottom-up.
Couldn't they remove obsolete instructions from the ISA & then emulate removed instructions? Sure, it would be slower than having them still in the ISA, but given software using them was written for older machines, it might wash out in the end.
Didn’t work out so well
For raw single-thread performance:
1. ARM64 is a fixed-width instruction set, so their frontend can decode more instructions in parallel.
2. They got one honking monster of an out-of-order execution engine. (630 entries), which feed:
3. 16 execution ports.
I think I understand 1) as since they know the width they can more accurately divide the instructions to more parallel executers (whatever they are - the execution ports?)
2) I believe this allows more "pre-work" to get done before it's actually needed, but then the "pre-work" just chills until
3) these things do the work, and there an abnormally high amount of them?
p.s. Any noob friendly reading is also appreciated!
For 2, imagine a boa-constrictor swallowing a huge piece of prey. One mouth (CPU: the frontend) and one rear (CPU: the retirement phase). The instructions go in the front end in the program order. They are decoded into operations that pile up in the middle (the giant bulge in the boa constrictor). When an instruction is ready to go, one of the execution ports (3--think of 16 little stomachs) picks up an instruction and executes it. Then at the backend, the retirement phase, instructions are committed in the order they appeared in the original program, so that the program computes the same result.
By making basically all of the pieces of this boa constrictor bigger and more numerous, it eats a lot more instructions per clock (on average). Making that bulge (the reorder buffer) huge allows the CPU to have high chance of some useful work to feed to one of its 16 stomachs.
But you can't really do this in parallel as the start for each word depends on the previous split already being known.
If it was simply law that every word in existence was 5 characters, you could parse this out with zero lookups, zero knowledge. "accurately" isn't so much the issue, it's that you have to decode each instruction to know where the next starts.
But in the end all of that went bye bye when Intel lost the process edge and therefore lost the transistor count advantage. Now with the 5nm process others can field gobs of transistors and they don't have the x86 frontend millstone around their necks. So ARM64 unlocked a lot of frontend bandwidth to feed even more execution ports. And with the transistor budget so high, 8 massive cores could be put on die.
Now, people have argued for decades that the instruction density of CISC is a major advantage, because that density would make better use of I-cache and bandwidth. But it looks like decode bandwidth is the thing. That, and RISC usually requires aligned instructions, which means that branch density cannot be too high, and branch prediction data structures are simpler and more effective. (Intel still has weird slowdowns if you have too many branches in a cache line).
It seems frontend effects are real.
A bet doing 8-wide x86 decoding would be tough, but once you've got a micro-up cache, it's doable so long as you have a cache hit. Zen 3 is 8-wide the 95% of the time you hit the micro-up cache.
The real question is how does Apple keep that thing fed? An 8-wide decoder is pointless if most of the time you've got 6 empty pipelines: https://open.hpi.de/courses/parprog2014/items/aybclrPgY4nPyY... (discussing ILP wall). M1 outperforming Zen 3 by 20% on the SPEC GCC benchmark, at 1/3 lower clocks-speed. That's 80% more ILP than an Zen 3, which is itself a large advance in ILP.
My point was more about that the fixed-width instructions allow trivial parallel decoding while x86 requires predicting the length of instructions or just brute forcing all possible offsets, which is costly.
> The real question is how does Apple keep that thing fed?
That's why there's such an enormous reorder buffer. It's so that there's a massive amount of potential work out there for execution ports to pick up and do. Of course, that's all wasted when you have a branch mispredict. I haven't seen anything specific about M1's branch prediction, but it is clearly top-notch.
As a real world example the X360 had a unified memory architecture and PS3 had a split along system/gpu. From a CPU performance perspective they were pretty close(although the SPUs in the PS3 could really go if you vectorized your data for them appropriately).
https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2
https://medium.com/swlh/what-does-risc-and-cisc-mean-in-2020...Apple M1 has 16 units that can pipeline their instructions.
Meaning, they can reorder sequential instructions that aren't dependent on each other to run in parallel. That is not threads or anything, that can be and is being done in a single threaded program.
AMD and Intel have 4 units for reordering tops, because their architecture is CISC and on instruction can be up to 15 bytes. M1 is RISC and instructions are just 4 byte fixed-length. Thus architecturally it is easier to reorder instructions for RISC than CISC.
CISC were better because of the specific instructions but now Apple has stuffed their CPU with specific hardware for alot of things including machine learning, graphic processor and encryption, instead of specific instructions, Apple has specific hardware, and can do with less instructions.
And since they control hardware, software SDKs and OS they can actually get away with such radical changes. Intel and others can't, without a big change in industry.
Source: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...
2) Reordering is different than pipelining, and CPUs have done both for decades. The difference between the M1 and Intel/AMD is that the M1 is wider in spots and can do much more extensive reordering. The M1 can decode and issue 8 instructions at a time. AMD can do 4 or 8 depending on whether the instruction is coming from memory or a special cache for pre-decoded instructions. The M1 has a reorder buffer of over 600 instructions—meaning it can have 600 instructions waiting for completion at a time (e.g. some executing while others are waiting for data to come back from memory). Intel and AMD’s reorder buffers are half the size.
3) Special instructions and controlling the software interface has little to do with performance on general purpose code.
CISC instruction in are still variable length. People can argue that micro-ops are RISC like, but micro-code is an implementation detail very close to hardware.
One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.
Time is more critical when running micro-ops than when compiling. It is an obvious advantage in making it possible for advance compiler to rearrange code rather than relying on precious silicon to do it.
While RISC processors have gotten more specialized instructions over the years, e.g. for vector processing. They still lack the complexity of memory access modes that many CISC instructions have.
(replaced the subjects with auto-industry corollaries)
I think it's the power of vertical integrations. When you aren't cobbling together a bunch of general-purpose bits for general-purpose consumers, you don't have to pay as many taxes. Sort of like SRP in software - a multi-purpose function is going to become super bloated and expensive to maintain, compared to several single-purpose alternatives.
https://en.wikipedia.org/wiki/Single-responsibility_principl...
Vertical integration is like taking horizontally-integrated business units and refactoring them per SOLID principles.
Intel CPUs don't know what memory they're talking to, i.e. they have to support a variety of memory. Likewise they don't necessarily know what OS they're running; how it context switches, etc. If it's virtualized, etc. Sure they have optimizations for those common cases, but the design is sort of accreted rather than derived from first principles.
To make an analogy, if you know your JS is running on v8, you can do a bunch of optimization so you don't fall off the cliff of the JIT, and get say 10x performance wins in many cases. But if you're writing JS abstractly then you may use different patterns that hit slow paths in different VMs. Performance is a leaky abstraction.
Are we seeing a deeper bifurcation of the industry; personal vs server
Maybe intel and others can happily coexist?
I think you need to look at this from another angle. Yes, Apple did make some excellent choices, but the market was Intel's to lose.
The difference in the chips isn't limited to the 5nm, memory pooling, etc etc. Look at the base x86 vs ARM core architecture, and that is where you'll see the problem Intel had.
I'm sure there were discussions inside Intel which went along the lines of one person arguing that they had to start developing ARM based or other RISC based chips, and somebody else countering "but at Intel we build for the desktop, and servers, and RISC processors are toys, they're for mobile devices and tablets. They'll never catch-up with our..."
This change in architecture was a long time coming. As we all know, there is very little we do with our computers today, that we can't also accomplish on a phone (or tablet). The processing requirements for the average person are not that large, and ARM chips, made by Apple, Qualcomm, Samsung, or anybody else, have improved to the point they are up to some of the more demanding tasks. Even able to play high quality games at a good frame-rate or edit video.
So, now we have to ask, what was delaying the move from x86 to ARM. Apple aren't the only ones making ARM based computers. Microsoft has two generations of ARM based Surface laptops out, and I think samsung has made one too. I'm sure there are others. This is a wave that has been building for a long time.
So, now we can look at why Apple was able to be so successful in their ARM launch compared to Microsoft and the lackluster reviews of Windows based ARM devices.
From my understanding, it isn't the 5nm technology, though I a no expert in chip design. However, as you state, Apple was able to pool memory, and put their memory right on the chip, which (from what I understand) saves overhead of transferring memory in and out, as well as allowing CPU and GPU to share memory more efficiently.
As I understand it, the Qualcomm or other chips have a much smaller internal memory footprint, expecting the memory to be external to the CPU/GPU. Perhaps because this is just always the way it has been done.
Now this is where Apple's real breakthrough comes in. First off, they have the iOS app store and all the apps now available to use on desktop. This means all the video editing or gaming apps that were already designed for iOS can now run perfectly fine on the "new" ARM architecture. Then there is Rosetta2. Apple understood how important running legacy software for a small number of their users would be, and I suspect they also had very good metrics on what those legacy programs were. They did an exceptional job on Rosetta (from what I understand), and should be commended on that. Though most users will likely never use Rosetta extensively, it goes a huge way to making the M1 chip an absolute no brainer.
Compare Rosetta to Microsoft's attempt at backward compatibility, and the difference seems glaring. HOWEVER, I think again this comes down to strategy and execution. Apple knows that only a small number of their customers need a small number of apps to run in Rosetta. Microsoft, having both a larger user base, AND much more bespoke software running on their platform, don't have this luxury.
I'm sure there are other factors, but my thinking is it is less about direct technology and more flawed strategy/execution from Intel and absolutely amazing execution from Apple.
I'm very torn by this all tbh. I've been an Apple hater for a long time. Every Apple product I've bought has turned out to be crap (except my original generation 2 iPod, it was truly magical). I'm beginning to think Apple may have actually got the upper hand here.