Ask HN: How did Apple manage to create such a better chip than Intel?

68 pointsaarkay5y ago81 comments

The recent Apple M1 chip is faster and more efficient than the equivalent intel chips. Can someone with a better understanding of chip design explain to me:

1/ What specific design choices make the M1 so much better than the equivalent intel chips ? It looks like there are a bunch of changes -- 5nm silicon, single memory pool, a combination of high efficiency and high power cores. Can someone explain to me how each of these changes helps apple achieve the gains it did ? Are these breakthrough architectural changes in chip design or have these been discussed before?

2/ How did apple manage to create this when intel has been making chips for decades and that is the singular focus of the company ? Is it the fact that Mac OS could be better optimized for the M1 chips ? Given the design changes that doesn't seem like the only reason.

68 pointsaarkay5y ago81 comments

The recent Apple M1 chip is faster and more efficient than the equivalent intel chips. Can someone with a better understanding of chip design explain to me:

81 comments

65 comments · 22 top-level

oneplane5y ago· 9 in thread

It's not just the M1's design, it's what they don't have to do: no need to support anything legacy. You can't change the x86 ISA to the point where it makes a huge difference because it would no longer run x86 code.

Intel can probably make faster stuff than they currently do but then their customers (PC manufacturers for instance) would have to modify all their stuff as well and they don't want to, or at least, won't want the same thing.

k0stas5y ago

The overhead of translating x86 instructions to RISC-like microcode is known and is something like 1-2%. This is not the reason for the difference between Intel and Apple.

hartator5y ago

They fully support i386 via Rosetta 2 though.

The real explanation is Intel has been complacent and lazy. We had 5 generations of the same chip. Enough is enough.

harpratap5y ago

What about AMD though? They are at par with M1 but use a lot more power while doing so. I don't think it's an Intel problem, it's an x86 problem

1 more reply

silly-silly5y ago

Nobody writes i386 code anymore though, Apple mostly support x86-64 from 10 years ago.

1 more reply

h0l0cube5y ago

> Intel can probably make faster stuff than they currently do but then their customers (PC manufacturers for instance)n would have to modify all their stuff

This is a crucial point. It's a coordination problem.

Drivers only need to be made for those components that Apple choose for their system vs the multitude combinations of GPU boards, drive systems, motherboard chips, wifi etc etc. that exist for a modern PC.

There's no steering committee for PCs (at least none that could be incorruptible by Intel) that could cause this change to happen industry wide. And there's been little appetite (yet) for Windows/Linux + ARM for consumer PCs to help this happen from the bottom-up.

1 more reply

mcphage5y ago

> You can't change the x86 ISA to the point where it makes a huge difference because it would no longer run x86 code.

Couldn't they remove obsolete instructions from the ISA & then emulate removed instructions? Sure, it would be slower than having them still in the ISA, but given software using them was written for older machines, it might wash out in the end.

the_arun5y ago

I respectfully disagree. Why can't Intel/AMD make new flavor of chips & motherboards explicitly saying - it doesn't support x86 ISA. Then, wouldn't we address that problem?

01343405y ago

Apple has an advantage here because they have both an integrated hardware and software platform they can control. If only Intel did the same they'd be relying on Microsoft to provide a translation layer for their chips and that'd be unlikely unless they paid up and Microsoft dared to stray from its traditional course.

stephenr5y ago

Intel tried that with Itanium.

Didn’t work out so well

kortex5y ago· 8 in thread

https://news.ycombinator.com/item?id=25257932

My 10,000' view understanding:

- Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

- more efficient transistors. 7nm uses finFet. M1 probably uses GAAFET, which means you can cram more active gate volume in less chip footprint. This means less heat and more speed

- layout. M1 is optimized for Apple's use case. General purpose chips do more things, so they need more real estate

- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks

- M1 has tons of cache, IIRC, something like 3x the typical amount

- some fancy stuff that allows them to really optimize the reorder buffer, decoder, and branch prediction, which also leverages all that cache

GoldenStake5y ago

One small correction, TSCM 5nm still uses finFet Samsungs 3nm is being testing with GAAFET and has it slated for 2021. TSMCs roadmap also has it but for 2022, and intel at 2025

rayiner5y ago

The fancy stuff with the reorder buffer, decoder, and branch prediction is the most important thing. The M1 is just as general purpose as any Intel/AMD CPU, and indeed even more so because it's designed to scale from cell phones to desktops.

Specialization helps when it helps, but doesn't do much on typical programs and M1 still excels on those.

gizmodo595y ago

Can you explain what you mean by Apple's use case vs general purpose chips?

erulabs5y ago

The M1 does one thing: run MacOS and MacOS apps. They can control the vast majority of the compiled code that will be run on the chip - unlike an x86 platform where the exact same architecture is used for desktops, servers and everything in between - including Linux, windows, and Mac.

Specifically there is a reference counting optimization on the M1 that dramatically helps performance of compiled Swift apps - something only worthwhile if you know the majority of what the chip will ever do is run swift apps.

3 more replies

judge20205y ago

> - Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

See: https://news.ycombinator.com/item?id=25277124

gruez5y ago

>- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks

that wouldn't help in benchmarks, would it? Only in the most dishonest benchmarks would they compare x264 to a hardware h.264 encoder, for instance.

flumpcakes5y ago

Maybe - but how do you know they are off loading to a hardware encoder?

Example: I run a benchmark for AES encryption - a modern CPU will have circuitry designed explicitly for this task and it's asm instructions. An old CPU just supporting the base x86 instructions probably doesn't have a hardware solution. Is it unfair to compare them?

If the utilisation of the hardware accelerators is completly opaque to the user (not importing special libraries) is it unfair that one CPU has specific hardware implementation for common tasks and one only has the generic circuitry?

1 more reply

aarkayOP5y ago

Thanks for the link to the old HN item. Missed it the first time. Seems like having the ability to control the entire SoC helped apple take a lot of decisions specific to how they wanted to use the chips.

titzer5y ago· 4 in thread

Lots of other comments point out the vertical integration.

For raw single-thread performance:

1. ARM64 is a fixed-width instruction set, so their frontend can decode more instructions in parallel.

2. They got one honking monster of an out-of-order execution engine. (630 entries), which feed:

3. 16 execution ports.

maxioatic5y ago

I don't fully grasp assembly, instruction sets, and how CPUs work so pardon the silly questions.

I think I understand 1) as since they know the width they can more accurately divide the instructions to more parallel executers (whatever they are - the execution ports?)

2) I believe this allows more "pre-work" to get done before it's actually needed, but then the "pre-work" just chills until

3) these things do the work, and there an abnormally high amount of them?

p.s. Any noob friendly reading is also appreciated!

titzer5y ago

For 1), just think of instructions of little bundles of bytes. The CPU runs through the instructions in forward order, jumping around to other bits of the code as it goes. X86 has variable-width instructions (i.e from 1 byte up to 17 bytes--X86 is complex and there are a lot of prefix bytes that have been used to add new functionality over the years). To determine how long an instruction is, you need to decode the bits of the instruction. For ARM64, and most other ISAs nowadays, the instructions are all 4 bytes long. That means they can all be decoded in parallel.

For 2, imagine a boa-constrictor swallowing a huge piece of prey. One mouth (CPU: the frontend) and one rear (CPU: the retirement phase). The instructions go in the front end in the program order. They are decoded into operations that pile up in the middle (the giant bulge in the boa constrictor). When an instruction is ready to go, one of the execution ports (3--think of 16 little stomachs) picks up an instruction and executes it. Then at the backend, the retirement phase, instructions are committed in the order they appeared in the original program, so that the program computes the same result.

By making basically all of the pieces of this boa constrictor bigger and more numerous, it eats a lot more instructions per clock (on average). Making that bulge (the reorder buffer) huge allows the CPU to have high chance of some useful work to feed to one of its 16 stomachs.

soneil5y ago

I think it's easy to underestimate how much difference (1) makes. Take the famous line "thequickbrownfoxjumpsoverthelazydog" - and think how you'd parse that out programatically. You'd start at the start, reading each character in, comparing it against a dictionary, and when you decide you have a whole word - then you can split that word out - and then continue on to the next.

But you can't really do this in parallel as the start for each word depends on the previous split already being known.

If it was simply law that every word in existence was 5 characters, you could parse this out with zero lookups, zero knowledge. "accurately" isn't so much the issue, it's that you have to decode each instruction to know where the next starts.

jasonwatkinspdx5y ago

Yup, you've got the basic ideas. Hennessy and Patterson's books are the standard rec. "Computer Organization and Design" one is version more targeted at developers, and "Computer Architecture: A Quantitative Approach" is more focused on CE's or people that will be getting more into the guts.

titzer5y ago· 4 in thread

I think the M1 chip finally proves the inherent design superiority of RISC over CISC. For years, Intel stayed ahead of all other competitors by having the best process, clockspeeds, and the most advanced out-of-order execution. By internally decoding CISC to RISC, Intel could feed a large number of execution ports to extract maximum ILP. They had to spend gobs of silicon for that: complex decoding, made worse by the legacy of x86's encodings, complex branch prediction, and all that OOE takes a lot of real estate. They could do that because they were ahead of everyone else in transistor count.

But in the end all of that went bye bye when Intel lost the process edge and therefore lost the transistor count advantage. Now with the 5nm process others can field gobs of transistors and they don't have the x86 frontend millstone around their necks. So ARM64 unlocked a lot of frontend bandwidth to feed even more execution ports. And with the transistor budget so high, 8 massive cores could be put on die.

Now, people have argued for decades that the instruction density of CISC is a major advantage, because that density would make better use of I-cache and bandwidth. But it looks like decode bandwidth is the thing. That, and RISC usually requires aligned instructions, which means that branch density cannot be too high, and branch prediction data structures are simpler and more effective. (Intel still has weird slowdowns if you have too many branches in a cache line).

It seems frontend effects are real.

rayiner5y ago

ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache: https://www.anandtech.com/show/15813/arm-cortex-a78-cortex-x.... And we don't know that the M1 doesn't do that. Also, even on Zen 2, the entire decode section is still a fraction of the size of say the vector units: https://forums.anandtech.com/threads/annotated-hi-res-core-d.... And the cores themselves take up a small amount of the die space on a modern CPU: https://cdn.mos.cms.futurecdn.net/m22pkncJXbqSMVisfrWcZ5-102....

A bet doing 8-wide x86 decoding would be tough, but once you've got a micro-up cache, it's doable so long as you have a cache hit. Zen 3 is 8-wide the 95% of the time you hit the micro-up cache.

The real question is how does Apple keep that thing fed? An 8-wide decoder is pointless if most of the time you've got 6 empty pipelines: https://open.hpi.de/courses/parprog2014/items/aybclrPgY4nPyY... (discussing ILP wall). M1 outperforming Zen 3 by 20% on the SPEC GCC benchmark, at 1/3 lower clocks-speed. That's 80% more ILP than an Zen 3, which is itself a large advance in ILP.

titzer5y ago

> ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache

My point was more about that the fixed-width instructions allow trivial parallel decoding while x86 requires predicting the length of instructions or just brute forcing all possible offsets, which is costly.

> The real question is how does Apple keep that thing fed?

That's why there's such an enormous reorder buffer. It's so that there's a massive amount of potential work out there for execution ports to pick up and do. Of course, that's all wasted when you have a branch mispredict. I haven't seen anything specific about M1's branch prediction, but it is clearly top-notch.

cma5y ago

Other things helping are forced >=16Kb page sizes, and massive L1 caches (M1 has 4X Zen 3's L1 data cache and 3X the L1 instruction cache; how much of that cache size is enabled by new process node and larger page sizes vs just lack of x86 decode I don't know).

bcrl5y ago

L1 cache size is driven by the target clock frequency. Apple is not aiming for 5+GHz, whereas both Intel and AMD cores can turbo above 5GHz these days.

ibraheemdev5y ago· 4 in thread

I just wanted to point out that it is not Apple out of the blue made a chip better than Intel's. They have also been designing chips for quite a while. The APL0098 chip that was used in the original iPhone was introduced back in 2007.

redisman5y ago

Right the M1 is based on A14 - the iPhone/iPad chip. It’s just the first PC one based on that architecture. Making top of the line mobile chips has given them a huge efficiency leg up. The SoC also gives big gains compared to a traditional separation of memory modules and CPUs

vvanders5y ago

SoC doesn't make as large of a difference as you'd think. The only place you really get hammered is if your moving a lot of memory between separate memory domains.

As a real world example the X360 had a unified memory architecture and PS3 had a split along system/gpu. From a CPU performance perspective they were pretty close(although the SPUs in the PS3 could really go if you vectorized your data for them appropriately).

2 more replies

aarkayOP5y ago

Which also implies that intel had a lot of time to catch up.

rorykoehler5y ago

I read somewhere (possibly on this site) that intel made a purposeful decision to not invest in trying to catch up as they didn't believe in the feasibility of 5nm and lower due to difficulties managing thermal issues. Not sure how true that is but it would explain a lot.

pedalpete5y ago· 3 in thread

Innovators Dilemma.

I think you need to look at this from another angle. Yes, Apple did make some excellent choices, but the market was Intel's to lose.

The difference in the chips isn't limited to the 5nm, memory pooling, etc etc. Look at the base x86 vs ARM core architecture, and that is where you'll see the problem Intel had.

I'm sure there were discussions inside Intel which went along the lines of one person arguing that they had to start developing ARM based or other RISC based chips, and somebody else countering "but at Intel we build for the desktop, and servers, and RISC processors are toys, they're for mobile devices and tablets. They'll never catch-up with our..."

This change in architecture was a long time coming. As we all know, there is very little we do with our computers today, that we can't also accomplish on a phone (or tablet). The processing requirements for the average person are not that large, and ARM chips, made by Apple, Qualcomm, Samsung, or anybody else, have improved to the point they are up to some of the more demanding tasks. Even able to play high quality games at a good frame-rate or edit video.

So, now we have to ask, what was delaying the move from x86 to ARM. Apple aren't the only ones making ARM based computers. Microsoft has two generations of ARM based Surface laptops out, and I think samsung has made one too. I'm sure there are others. This is a wave that has been building for a long time.

So, now we can look at why Apple was able to be so successful in their ARM launch compared to Microsoft and the lackluster reviews of Windows based ARM devices.

From my understanding, it isn't the 5nm technology, though I a no expert in chip design. However, as you state, Apple was able to pool memory, and put their memory right on the chip, which (from what I understand) saves overhead of transferring memory in and out, as well as allowing CPU and GPU to share memory more efficiently.

As I understand it, the Qualcomm or other chips have a much smaller internal memory footprint, expecting the memory to be external to the CPU/GPU. Perhaps because this is just always the way it has been done.

Now this is where Apple's real breakthrough comes in. First off, they have the iOS app store and all the apps now available to use on desktop. This means all the video editing or gaming apps that were already designed for iOS can now run perfectly fine on the "new" ARM architecture. Then there is Rosetta2. Apple understood how important running legacy software for a small number of their users would be, and I suspect they also had very good metrics on what those legacy programs were. They did an exceptional job on Rosetta (from what I understand), and should be commended on that. Though most users will likely never use Rosetta extensively, it goes a huge way to making the M1 chip an absolute no brainer.

Compare Rosetta to Microsoft's attempt at backward compatibility, and the difference seems glaring. HOWEVER, I think again this comes down to strategy and execution. Apple knows that only a small number of their customers need a small number of apps to run in Rosetta. Microsoft, having both a larger user base, AND much more bespoke software running on their platform, don't have this luxury.

I'm sure there are other factors, but my thinking is it is less about direct technology and more flawed strategy/execution from Intel and absolutely amazing execution from Apple.

I'm very torn by this all tbh. I've been an Apple hater for a long time. Every Apple product I've bought has turned out to be crap (except my original generation 2 iPod, it was truly magical). I'm beginning to think Apple may have actually got the upper hand here.

aarkayOP5y ago

Time to move on to the apple bandwagon :) I've been a mac user for 10 years and it felt like Apple completely lost the plot with their laptops from 2015-2019. Glad to see them make a come back and do what they do best -- Make excellent hardware and operating systems.

Groxx5y ago

Just as a small note, as far as I can tell the PPC -> x86 version of Rosetta was at least partly licensed from "Transitive Corporation" by Apple, not entirely in-house: https://en.wikipedia.org/wiki/QuickTransit . I have no idea about Rosetta2 though.

jibcage5y ago

Transitive was acquired by Apple, I imagine Rosetta 2 was the product of many of the same engineers.

1 more reply

dboreham5y ago· 3 in thread

Only supporting 8G RAM may help too.

mcphage5y ago

How would less RAM make the M1 faster?

jmarcher5y ago

16GB. Still not enough though.

678680185y ago

The M1 CPU only supports a 36bit address space, so nothing can be done right now

Fazel945y ago· 2 in thread

ARM is RISC , Intel and AMD are CISC, One important reason is their new pipelining facility.

Apple M1 has 16 units that can pipeline their instructions.

Meaning, they can reorder sequential instructions that aren't dependent on each other to run in parallel. That is not threads or anything, that can be and is being done in a single threaded program.

AMD and Intel have 4 units for reordering tops, because their architecture is CISC and on instruction can be up to 15 bytes. M1 is RISC and instructions are just 4 byte fixed-length. Thus architecturally it is easier to reorder instructions for RISC than CISC.

CISC were better because of the specific instructions but now Apple has stuffed their CPU with specific hardware for alot of things including machine learning, graphic processor and encryption, instead of specific instructions, Apple has specific hardware, and can do with less instructions.

And since they control hardware, software SDKs and OS they can actually get away with such radical changes. Intel and others can't, without a big change in industry.

Source: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...

rayiner5y ago

1) There’s not a meaningful difference between RISC and CISC on modern architectures. CISC has certain advantages these days because they are compact at encoding memory operations. Intel and AMD crack instructions into operations called micro-ops. There is no meaningful difference between how easy it is to reorder the micro-ops versus RISC ops. CISC or RISC, the internal structures of the processor operate on something quite different than the instruction the instruction as written in memory. (A decoded form.)

2) Reordering is different than pipelining, and CPUs have done both for decades. The difference between the M1 and Intel/AMD is that the M1 is wider in spots and can do much more extensive reordering. The M1 can decode and issue 8 instructions at a time. AMD can do 4 or 8 depending on whether the instruction is coming from memory or a special cache for pre-decoded instructions. The M1 has a reorder buffer of over 600 instructions—meaning it can have 600 instructions waiting for completion at a time (e.g. some executing while others are waiting for data to come back from memory). Intel and AMD’s reorder buffers are half the size.

3) Special instructions and controlling the software interface has little to do with performance on general purpose code.

gzer05y ago

I disagree, here's why:

CISC instruction in are still variable length. People can argue that micro-ops are RISC like, but micro-code is an implementation detail very close to hardware.

One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.

Time is more critical when running micro-ops than when compiling. It is an obvious advantage in making it possible for advance compiler to rearrange code rather than relying on precious silicon to do it.

While RISC processors have gotten more specialized instructions over the years, e.g. for vector processing. They still lack the complexity of memory access modes that many CISC instructions have.

swang7205y ago· 2 in thread

They weren't weighed down by the legacy bloat in the x86 instruction set architecture.

dboreham5y ago

Except we already had 30 years of other ISAs without that bloat, and they were all resoundingly beaten by Intel.

wahern5y ago

I think the answer is that the marginal benefits of a better ISA were less than the marginal benefits of better node process and faster iteration that Intel enjoyed. But for various reasons Intel no longer enjoys those advantages. With TSMC Apple has the process advantage, and their smartphone business has given them both the motivation and cash to iterate their architecture faster than Intel. The simpler ISA has compounded those advantages.

acranox5y ago· 1 in thread

There have been several HN links discussing this. Have you read these yet? I thought they did a pretty good job answering your questions.

  https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

https://medium.com/swlh/what-does-risc-and-cisc-mean-in-2020...

aarkayOP5y ago

Thanks for the links. These articles do indeed answer a lot of my questions.

Koiwai5y ago· 1 in thread

Several comments mentioned Apple M1 doesn't need to support legacy, but Rosetta 2's support for amd64(yes I choose this term over x86-64) is beyond great, and I looked into that specifically a while ago, some mention Apple had something designed specifically for amd64 emulation, So I'm against that point.

678680185y ago

amd64 is the correct term. It's amd's instruction set; they were the first to do it. The BSDs call it amd64, it's only Linux being the odd duck calling it x86_64

jcfrei5y ago· 1 in thread

This makes me wonder: If there's such a benefit from creating an integrated and specialized chip, will the next consoles follow the same approach? Will they be ARM based? If Microsoft and Sony follow this same model then PC games might be left behind with poorer graphics and fewer titles.

h0l0cube5y ago

Consoles are already like this. The PS4, for instance, had a unified memory access between it's GPU/CPU, and custom chips silicon were basically the norm for consoles PS3 and earlier across all manufacturers. The PS5 supports custom silicon to get insane SSD streaming straight to memory to have almost instantaneous download times. In fact, older Apples, the Amiga etc. from the 80s were all powered custom silicon. Apple is taking this trope from the console world, and using it to disruptive effect in the modern PC.

wpg_steve5y ago· 1 in thread

I read that the out of order execution of the RISC was simpler to handle with the fixed 32 bit instructions. They said Apple managed to dispatch 8 instructions in parallel whereas the hi end CISC (x86) tops out at 4.

throwarchitect5y ago

The greater simplicity of ARMv8 and its fixed sized instructions definitely helps, but also Intel runs their cores at nearly 2x higher frequency, which means a lot less logic can be squeezed into a clock cycle. That makes it much harder to to make a wider processor.

rayiner5y ago

Unfortunately Jon Stokes at Ars Technica and David Kanter at RWT have most stopped doing CPU design articles. The AnandTech one is the best I’ve seen: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

ksaj5y ago

My suspicion is that they don't need to worry about backward compatibility, or compatibility outside of their own hardware choices at all. That opens the door for them to tie OS decisions with CPU decisions straight from a single product family perspective.

fstopmick5y ago

> How did Tesla manage to create this when GM has been making cars for decades and that is the singular focus of the company?

(replaced the subjects with auto-industry corollaries)

I think it's the power of vertical integrations. When you aren't cobbling together a bunch of general-purpose bits for general-purpose consumers, you don't have to pay as many taxes. Sort of like SRP in software - a multi-purpose function is going to become super bloated and expensive to maintain, compared to several single-purpose alternatives.

https://en.wikipedia.org/wiki/Single-responsibility_principl...

Vertical integration is like taking horizontally-integrated business units and refactoring them per SOLID principles.

https://en.wikipedia.org/wiki/SOLID

throwarchitect5y ago

Guess where some of Intel's engineers have fled to? People move around, so it's not like one company has a strangle-hold on knowledge that can't be replicated by another company, especially when one of those companies is willing to pay more for talent.

chubot5y ago

I imagine part of it is vertical integration. If you make the hardware that the CPU integrates with, and the OS that runs on the hardware, you can do a lot of optimization.

Intel CPUs don't know what memory they're talking to, i.e. they have to support a variety of memory. Likewise they don't necessarily know what OS they're running; how it context switches, etc. If it's virtualized, etc. Sure they have optimizations for those common cases, but the design is sort of accreted rather than derived from first principles.

To make an analogy, if you know your JS is running on v8, you can do a bunch of optimization so you don't fall off the cliff of the JIT, and get say 10x performance wins in many cases. But if you're writing JS abstractly then you may use different patterns that hit slow paths in different VMs. Performance is a leaky abstraction.

wangchucheng5y ago

Apple has less technical debt and more aggressively eliminates old technologies.And Apple makes its own laptops and uses its own operating system. This allows Apple to provide relatively more complete support.

6565656565655y ago

Does the unified memory architecture mean these SoCs will always have “limited” memory as some is in use by graphics?

Are we seeing a deeper bifurcation of the industry; personal vs server

Maybe intel and others can happily coexist?

parisianka5y ago

I'd like to know this too.

phendrenad25y ago

It just shows the power of RISC. Soon there will be RISC-V chips (or at least Samsung ARM chips - remember them?) that will closely follow it's lead, mark my words...

j / k navigate · click thread line to collapse

81 comments

65 comments · 22 top-level

oneplane5y ago· 9 in thread

k0stas5y ago

The overhead of translating x86 instructions to RISC-like microcode is known and is something like 1-2%. This is not the reason for the difference between Intel and Apple.

hartator5y ago

They fully support i386 via Rosetta 2 though.

The real explanation is Intel has been complacent and lazy. We had 5 generations of the same chip. Enough is enough.

harpratap5y ago

What about AMD though? They are at par with M1 but use a lot more power while doing so. I don't think it's an Intel problem, it's an x86 problem

1 more reply

silly-silly5y ago

Nobody writes i386 code anymore though, Apple mostly support x86-64 from 10 years ago.

1 more reply

h0l0cube5y ago

> Intel can probably make faster stuff than they currently do but then their customers (PC manufacturers for instance)n would have to modify all their stuff

This is a crucial point. It's a coordination problem.

1 more reply

mcphage5y ago

> You can't change the x86 ISA to the point where it makes a huge difference because it would no longer run x86 code.

the_arun5y ago

I respectfully disagree. Why can't Intel/AMD make new flavor of chips & motherboards explicitly saying - it doesn't support x86 ISA. Then, wouldn't we address that problem?

01343405y ago

stephenr5y ago

Intel tried that with Itanium.

Didn’t work out so well

kortex5y ago· 8 in thread

https://news.ycombinator.com/item?id=25257932

My 10,000' view understanding:

- Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

- more efficient transistors. 7nm uses finFet. M1 probably uses GAAFET, which means you can cram more active gate volume in less chip footprint. This means less heat and more speed

- layout. M1 is optimized for Apple's use case. General purpose chips do more things, so they need more real estate

- M1 has tons of cache, IIRC, something like 3x the typical amount

- some fancy stuff that allows them to really optimize the reorder buffer, decoder, and branch prediction, which also leverages all that cache

GoldenStake5y ago

One small correction, TSCM 5nm still uses finFet Samsungs 3nm is being testing with GAAFET and has it slated for 2021. TSMCs roadmap also has it but for 2022, and intel at 2025

rayiner5y ago

Specialization helps when it helps, but doesn't do much on typical programs and M1 still excels on those.

gizmodo595y ago

Can you explain what you mean by Apple's use case vs general purpose chips?

erulabs5y ago

3 more replies

judge20205y ago

> - Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

See: https://news.ycombinator.com/item?id=25277124

gruez5y ago

that wouldn't help in benchmarks, would it? Only in the most dishonest benchmarks would they compare x264 to a hardware h.264 encoder, for instance.

flumpcakes5y ago

Maybe - but how do you know they are off loading to a hardware encoder?

1 more reply

aarkayOP5y ago

titzer5y ago· 4 in thread

Lots of other comments point out the vertical integration.

For raw single-thread performance:

1. ARM64 is a fixed-width instruction set, so their frontend can decode more instructions in parallel.

2. They got one honking monster of an out-of-order execution engine. (630 entries), which feed:

3. 16 execution ports.

maxioatic5y ago

I don't fully grasp assembly, instruction sets, and how CPUs work so pardon the silly questions.

I think I understand 1) as since they know the width they can more accurately divide the instructions to more parallel executers (whatever they are - the execution ports?)

2) I believe this allows more "pre-work" to get done before it's actually needed, but then the "pre-work" just chills until

3) these things do the work, and there an abnormally high amount of them?

p.s. Any noob friendly reading is also appreciated!

titzer5y ago

soneil5y ago

But you can't really do this in parallel as the start for each word depends on the previous split already being known.

jasonwatkinspdx5y ago

titzer5y ago· 4 in thread

It seems frontend effects are real.

rayiner5y ago

A bet doing 8-wide x86 decoding would be tough, but once you've got a micro-up cache, it's doable so long as you have a cache hit. Zen 3 is 8-wide the 95% of the time you hit the micro-up cache.

titzer5y ago

> ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache

> The real question is how does Apple keep that thing fed?

cma5y ago

bcrl5y ago

L1 cache size is driven by the target clock frequency. Apple is not aiming for 5+GHz, whereas both Intel and AMD cores can turbo above 5GHz these days.

ibraheemdev5y ago· 4 in thread

redisman5y ago

vvanders5y ago

SoC doesn't make as large of a difference as you'd think. The only place you really get hammered is if your moving a lot of memory between separate memory domains.

2 more replies

aarkayOP5y ago

Which also implies that intel had a lot of time to catch up.

rorykoehler5y ago

pedalpete5y ago· 3 in thread

Innovators Dilemma.

I think you need to look at this from another angle. Yes, Apple did make some excellent choices, but the market was Intel's to lose.

The difference in the chips isn't limited to the 5nm, memory pooling, etc etc. Look at the base x86 vs ARM core architecture, and that is where you'll see the problem Intel had.

So, now we can look at why Apple was able to be so successful in their ARM launch compared to Microsoft and the lackluster reviews of Windows based ARM devices.

I'm sure there are other factors, but my thinking is it is less about direct technology and more flawed strategy/execution from Intel and absolutely amazing execution from Apple.

aarkayOP5y ago

Groxx5y ago

jibcage5y ago

Transitive was acquired by Apple, I imagine Rosetta 2 was the product of many of the same engineers.

1 more reply

dboreham5y ago· 3 in thread

Only supporting 8G RAM may help too.

mcphage5y ago

How would less RAM make the M1 faster?

jmarcher5y ago

16GB. Still not enough though.

678680185y ago

The M1 CPU only supports a 36bit address space, so nothing can be done right now

Fazel945y ago· 2 in thread

ARM is RISC , Intel and AMD are CISC, One important reason is their new pipelining facility.

Apple M1 has 16 units that can pipeline their instructions.

Meaning, they can reorder sequential instructions that aren't dependent on each other to run in parallel. That is not threads or anything, that can be and is being done in a single threaded program.

And since they control hardware, software SDKs and OS they can actually get away with such radical changes. Intel and others can't, without a big change in industry.

Source: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...

rayiner5y ago

3) Special instructions and controlling the software interface has little to do with performance on general purpose code.

gzer05y ago

I disagree, here's why:

CISC instruction in are still variable length. People can argue that micro-ops are RISC like, but micro-code is an implementation detail very close to hardware.

One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.

While RISC processors have gotten more specialized instructions over the years, e.g. for vector processing. They still lack the complexity of memory access modes that many CISC instructions have.

swang7205y ago· 2 in thread

They weren't weighed down by the legacy bloat in the x86 instruction set architecture.

dboreham5y ago

Except we already had 30 years of other ISAs without that bloat, and they were all resoundingly beaten by Intel.

wahern5y ago

acranox5y ago· 1 in thread

There have been several HN links discussing this. Have you read these yet? I thought they did a pretty good job answering your questions.

  https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

https://medium.com/swlh/what-does-risc-and-cisc-mean-in-2020...

aarkayOP5y ago

Thanks for the links. These articles do indeed answer a lot of my questions.

Koiwai5y ago· 1 in thread

678680185y ago

amd64 is the correct term. It's amd's instruction set; they were the first to do it. The BSDs call it amd64, it's only Linux being the odd duck calling it x86_64

jcfrei5y ago· 1 in thread

h0l0cube5y ago

wpg_steve5y ago· 1 in thread

throwarchitect5y ago

rayiner5y ago

ksaj5y ago

fstopmick5y ago

> How did Tesla manage to create this when GM has been making cars for decades and that is the singular focus of the company?

(replaced the subjects with auto-industry corollaries)

https://en.wikipedia.org/wiki/Single-responsibility_principl...

Vertical integration is like taking horizontally-integrated business units and refactoring them per SOLID principles.

https://en.wikipedia.org/wiki/SOLID

throwarchitect5y ago

chubot5y ago

I imagine part of it is vertical integration. If you make the hardware that the CPU integrates with, and the OS that runs on the hardware, you can do a lot of optimization.

wangchucheng5y ago

6565656565655y ago

Does the unified memory architecture mean these SoCs will always have “limited” memory as some is in use by graphics?

Are we seeing a deeper bifurcation of the industry; personal vs server

Maybe intel and others can happily coexist?

parisianka5y ago

I'd like to know this too.

phendrenad25y ago

It just shows the power of RISC. Soon there will be RISC-V chips (or at least Samsung ARM chips - remember them?) that will closely follow it's lead, mark my words...

j / k navigate · click thread line to collapse