Skip to content

Top New Best Ask Show Jobs

Serious Intel CPU bugs (2016) (opens in new tab)

(danluu.com)

562 pointsderek8y ago106 comments

Serious Intel CPU bugs (2016) | Better HN

106 comments

slivym8y ago

As a former Intel employee this aligns closely with my experience. I didn't work in validation (actually joined as part of Altera) but velocity is an absolute buzzword and the senior management's approach to complex challenges is sheer panic. Slips in schedules are not tolerated at all - so problems in validation are an existential threat, your project can easily just be canned. Also, because of the size of the company the ways in which quality and completeness are 'acheived' is hugely bureaucratic and rarely reflect true engineering fundamentals. Intel's biggest challenge is simply that it's not 'winning big' at the moment and rather than strong leadership and focus the company just jumps from fad to fad failing at each (VR is dead, long live automotive).

nabla98y ago

I thought that the secret of Intel's success is paranoia. Andy Grove's motto was "Only the paranoid survive" and he wrote a book with the same name.

nine_k8y ago

Andy Grove, sadly, has retired from Intel in 2005, and died in 2016.

I'm afraid the level of paranoia at Intel has decreased since then.

baq8y ago

not quite true regarding the focus - you aren't building a $50B+ company when you don't have focus, it's just that their core compentency isn't fashionable right now. they can't grow in the markets they're in because they either own them completely or were driven out, so they're trying different ones (pivoting, if that's applicable to mature companies).

marcosdumay8y ago

They had a presence on the currently faster growing market (mobile), but they decided to sell it a while ago because it wasn't trendy and margins were smaller.

Now they are pivoting everywhere, but theirs is the only market with sufficient margins. And the perspective is that their margins will shrink because of competition and software emulation (that they are keeping into control by patent trolling).

markenstein8y ago

Intel had the problem of growing their market after their monopoly on 386-compatible processors in the late 80s, which AMD only caught up with their Am386 like 4 years later. Besides their shenanigans to rebate PC manufacturers for not using AMD processors, they did create the IAL—Intel Architecture Labs, which created and gave royalty-free technologies like PCI, AGP, USB, and PCI Express to move the industry forward so that it can grow. That was pretty innovative, at a time where getting a soundcard seemed more practical than choosing a faster processor.

Quote from article "We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are".

Competition pressure could make a company's new product worse than (in this case, less stable than) their previous products, e.x. Samsung phone explosion. I still remembered the story was Samsung wanting to release their phone ahead of iPhone and I would imagine the testing went through a similar stressful time as Intel.

Of course not all cases of taking such risks would lead to disasters - just imagine Intel rushes on releasing new chips ahead of competition and 99 out of 100 times it ended up performing well. But a unique character in Intel's case is these bugs, unlike a faulty battery design, are accumulative and additive to future product development, which means a few small wins in catching up with your competitor could also lead to massive failures in some next major battle.

Now imagine Intel's competitors are going through the exact same scenario. One possible outcome is both Intel and its competitors' products become less stable and more buggy over time, and until everyone's stuff seems to be broken they probably never have time to fix them.

dspillett8y ago

> "we can’t live forever in the shadow of the early 90’s FDIV bug"

There is a valid point there though - if you are testing for testing's sake and not finding anything extra through the extra effort then you are wasting time and potentially worse: lulling yourself into a false sense of security. Testing should be done for utility, not just in response to fear - you need to test intelligently, not just test lots. Like TTD in software, good testing processes make life much easier and quality much higher, bad testing processes can be worse than useless.

Processor bugs are always a thing and always have been a thing - look at the list of bugs the linux kernel scans for and works around many of which pre-date the FDIV debacle.

What made FDIV special isn't that is was a bad bug, it was the recent change in marketing. Before then processors were sold to manufacturers who might tell the customer what is used, unless you were a hobbyist you didn't much care for the specifics. But the Pentium line was the first time a processor had been particularly marketed directly at the end user. It had started with the 486 lines a couple of years earlier when "Intel Inside" was first a thing, but there was a huge push in that direction with the release of the first Pentium lines. Suddenly Joe Public was more aware of that detail, but was blissfully unaware that CPUs are complex beasts and generally not 100% perfect.

It didn't help that the bug was very easy to demonstrate in common applications like Excel so Joseph & Josephine Public could see and understand the problem where they wouldn't, for example, the FOOF bug, and it was easy to joke about (We are Pentium of Borg. Division is futile. You will be approximated) which fanned the rapid spread of the news. The fact that the bug only significantly affected fairly rare combinations was lost in the mass discussion about how such a bug could happen at all.

static_noise8y ago

Testing is not for finding bugs, testing is for preventing bugs. But otherwise you are right, it's hard to calibrate and further develop the testing procedure if you rarely find bugs. While you might wide awake on one eye you might be blind on seven others. It's usually the things you don't expect that kill you. So, some need to be paranoid, be very paranoid.

In other words, sometimes competition is a race to the bottom. But then a bug like this tends to have a "reset" effect on everybody in the market.

I look at a statement like "Our competition is moving much faster than we are" as craven and lacking vision. At that point a wisened old Zen master type figure should've stepped forward.

Competition isn't about imitating the competitor anyway, is it? It's about differentiation, right? Maybe not. But it's not like you can't easily market literally any reasonable decision you make. Paul Masson wineries bragged about selling "no wine before its time" and turned their lack of "velocity" into marketing cachet. (Even though they weren't even unique in that regard.) There's no theoretical reason why Intel couldn't market itself as "the accurate chipmaker," keep on validating "lavishly"(1) and let AMD rush headlong into this kind of bug.

(1) Obviously not... but unfortunately you never know it's not enough validation until it's not enough validation.

The FDIV bug was the original Pentium floating point bug, right? Could this be the biggest blow to Intel since that one?

friedButter8y ago

> Competition pressure could make a company's new product worse than (in this case, less stable than) their previous products

Well, that depends on the specific attributes on which there is competitive pressure. When its on time to market, yes, quality will suffer. When its on quality, products will be slower\more expensive,etc. Kind of similar to the often repeated quality triangle in Software dev

p0nce8y ago

> Kind of similar to the often repeated quality triangle in Software dev

Which was disproved in practice.

paulmd8y ago

Denverton is much more complex than a "simple" Atom (performance of a C3958 is up to about half of an i5-7500 in single-thread, twice the total multi-thread performance). Avoton is really no slouch either. It's really not surprising that the incidence of bugs is increasing on those uarchs as the complexity grows.

The Skylake/Kaby hyperthread bug has been fixed in microcode and is no longer applicable. It's perfectly safe to run HT on these processors now.

The AMD Ryzen segfault remains unmitigated at this point in time. Phoronix rushed to declare the bug fixed because they got a binned RMA replacement but there are plenty of reports of it occurring in current-production processors to at least a moderate degree, roughly proportionate with ASIC/litho quality. It's unclear what the scope is w/r/t Epyc since Epyc is on a different stepping but also hasn't really ramped yet either. The early Epyc processors were essentially engineering samples (on the order of hundreds to single-digit thousands of samples) with no real (public) visibility into any binning that might be taking place.

The Ryzen high-address bug is no big deal, that's the kind of thing that gets patched all the time (like the Skylake HT bug). That's one thing Dan is glossing over here - there are tons of these bugs all the time and as long as there is an effective mitigation available it's no big deal.

The PTI patch can be viewed as making syscalls take somewhat longer (about double iirc). Gamers and compute-oriented workloads won't be hurt hardly at all. The average mixed-workload case sees 5% performance loss, not ideal but it's not critical either. Losing 30% is real bad though, and that's what you will get on IO-heavy workloads that context-switch into the kernel a lot.

The only real mitigation there appears to be right now is to give up hyperconvergence for now and harden up those DB/NAS servers that are going to be pushing a lot of IO so that you know there won't be hostile code running on them. That will allow you to safely disable PTI and sidestep the performance hit.

Of course, Epyc was not that good at running databases in the first place, so you still might be better off sucking it up and running Intel even with the PTI patch. It will probably depend on your actual workload and the relative amount of IO vs processing.

> The Skylake/Kaby hyperthread bug has been fixed in microcode and is no longer applicable. It's perfectly safe to run HT on these processors now.

Only if you can actually get the fix. My main home PC has this bug and the motherboard manufacturer (ASUS) has yet to ship a BIOS update with the fix.

paulmd8y ago

Your OS can deliver a microcode fixup that's installed at OS startup, Windows should do it automatically and Linux you just need to install the intel-microcode package.

>Gamers and compute-oriented workloads won't be hurt hardly at all.

Actually KPTI doesn't only affect syscall but also interrupts. It makes interrups slower, which affects every workload.

cptskippy8y ago

Gamers like to use PS/2 peripherals because they're interrupt based and thus more responsive than USB peripherals.

Does this mean they could take a hit due to this bug?

Out of curiosity, if you notice a CPU bug in a computer under warranty, is there anything the vendor is usually obligated do, or are they under no obligation to do anything about a CPU bug? Is that considered a defect they have to handle?

(Edit: I'm assuming the USA, and I'm assuming bugs that were not known to the vendor at the time of the sale.)

dannyw8y ago

Under Australian law, it depends on if the defect is material, and if it would have reasonably changed your buying decision.

A 30% performance reduction (like the page table isolation fixes) probably would be considered material.

> Under Australian law, it depends on if the defect is material, and if it would have reasonably changed your buying decision.

Interesting, so if you need that particular product (say it has something specific you need, e.g. a program that only runs well on Intel) and there is no competitor to it with that particular feature (e.g. AMD CPU runs the program poorly) then they can sell you as otherwise-defective of a product as they want and you cannot recover damages?

Or to put it another way, there is no notion of "I would have still bought it because I needed it but knowledge of the defect would have lowered its market value"?

_up8y ago

Apple recently had to replace Iphone batteries because there device became slougish. So there is precedent.

amdavidson8y ago

Apple didn't have to replace batteries, they chose to do it to quiet down the bad PR they were receiving.

When Intel issues a microcode update to slow down aging Skylake processors so that everyone goes out to buy Cannonlake, you might be able to draw a comparison.

Under EU law (this is not legal advice), any bug that wasn't known to you at time of sale, but was present at time of sale, has to be fixed by the vendor. For 2 years after sale usually, some countries have longer (afaik, Norway for example has it set at 5 years. As EEA member, they also follow the EU laws on warranty).

> any bug that wasn't known to you at time of sale, but was present at time of sale, has to be fixed by the vendor

Unfortunately, this is incorrect on many levels.

First, under EU warranty laws, it's not any bug that is covered, but defects that have been assured or are expected to not be present. I'd expect disclaimers to allow for certain errata, for example. EDIT: user ta_wh posted an example for such a disclaimer in a sibling comment.

Second, the vendor is usually not the manufacturer, and therefore seldom in the position to fix the defect themselves.

Third, depending on the nature of the defect, the vendor might have other options besides fixing it/getting it fixed, eg: discount, or returns.

TrickyRick8y ago

Good luck explaining to a random vendor what the problem is though. They also need to verify that the problem exists and most of them won't have a clue what you're talking about, they'll just turn the computer on and notice that it's working.

If you're a company buying from more qualified vendors then it might be a different story, however at that point consumer law does not apply to you.

> Under EU law (this is not legal advice), any bug that wasn't known to you at time of sale, but was present at time of sale, has to be fixed by the vendor.

Is this true for software as well?

Ahh, I was referring to the US, and bugs that weren't known at the time of sale (I'll edit this in), but thank you, this is still useful information.

So I recently bought an 8700k. I was wondering if I should rather return it and get AMD instead? Not sure how much the recent bugs will impact me performance wise.

It looks like you'll see a significant slowdown, PostgreSQL with pgbench sees a 7% to 16% slowdown.

https://www.postgresql.org/message-id/20180102222354.qikjmf7...

Of course, this depends on workload — gaming will see different results than computationally heavy tasks.

It is likely that games using Vulkan, DX12 or OoenGL's AZDO functions will see much lower performance impact (because they usually only do a handful of syscalls per frame) than games using older APIs, or even OpenGLs immediate mode (which does one syscall per emitted vertex, in worst case)

badsectoracula8y ago

> or even OpenGLs immediate mode (which does one syscall per emitted vertex, in worst case)

Perhaps with drivers written in the 90s for hardware from the 90s. Any OpenGL implementation worth its salt will buffer those requests on the client side until they need to be observed. Indeed this was a big feature in the heyday of DirectX 9 where D3D programmers had to count the drawcalls whereas with OpenGL you have way more leeway with calls since the driver tends to be smarter and caches that stuff.

In theory with a modern driver using OpenGL's immediate mode API shouldn't need any more syscalls than building the vertex buffers in your program, setting up the necessary state and issuing a buffer draw command.

The only time where you'd need a syscall per emitted vertex would be if the GPU had OpenGL-like commands and your OpenGL implementation was a thin wrapper over that. I think one of ATI's very early GPUs worked like that (although the commands were per primitive, not per vertex).

I am also quite annoyed because I paid premium (Intel) prices with the expectation to get premium speed. Now if I can just get the same performance cheaper with AMD, maybe I should just return the whole thing.

But note that that those benchmarks pretty much are the worst case for PTI. Each of the queries is either near trivial (single pkey lookup) or the most trivial (SELECT 1), therefore the send/recv syscalls to/from the clients are taking the most time. If you instead have queries that do a bit more actual work this'd look very different.

The very article mentions at the bottom that AMD has its fair share of nasty bugs recently, too. If anything, I would expect AMD to spend even less effort on validation (because they are not flush with cash).

Or, because they're not flush with cash, they go for a simpler to verify design.

Remember, a lot of the Zen arch was developed by Jim Keller, who is the brains behind the Athlon 64.

MulliMulli8y ago

What bugs?

mtgx8y ago

8700k is an "end of the line" CPU in terms of motherboard support. Also AMD Ryzen CPUs give you much better bang for buck.

I can't tell if end of the line is a good or a bad thing here?

pier258y ago

It annoys me when some online content (or edit) hasn't a date explicitly stated.

lukax8y ago

Is this article from 2015 or 2017?

It would be great if the page displayed the date that the article was posted/updated. It is not in the URL nor the sources. The only way to see the dates is in the RSS feed and even that is only for new articles.

adtac8y ago

Jan 2016, according to https://danluu.com/ and ctrl+f

Last update is aug 2017 or later

mtgx8y ago

As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.
Why?
Let me set the scene: It’s late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: “We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are” - I’m paraphrasing.

Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn’t explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn’t seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

So this is why Krzanich sold his stock. He knows the bug is his fault. Whoops. I think someone may "quit for personal reasons" soon.

https://www.fool.com/investing/2017/12/19/intels-ceo-just-so...

_0w8t8y ago

I can only expect that future will be worse. It may be that VM providers will find it unprofitable to offer a VM capable of running generic native code. Another thing that security products for desktops like Qubes OS that rely on hardware isolation to run untrusted code may need to reconsider their business model.

walterbell8y ago

What’s the connection between the performance hit of this bug and Qubes’ business model?

_0w8t8y ago

This bug can be worked around. But the next one may not, making hardware-based virtualization as a secure way to run unprivileged code with max native performance unworkable. I.e. longer term if one wants to run untrusted code, it cannot be native one so any bug can be fixed without replacing hardware.

I have seen a lot of talk of AMD benefitting from this but what about ARM - how are their server offerings shaping up?

ARM is interesting because of the customisability of ARM chips. Some companies, such as Apple, license the ISA and spin their own silicon. Others license the whole CPU design from ARM holdings, and ARM produce several lines of designs with different CPU capabilities.

Edit: Looks like ARM64 was affected, but it has an architectural feature that makes the mitigation much easier: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-N...

drej8y ago

Can we put [2016] in the title? Thanks!

opencl8y ago

At least part of the article (the 'More updates' section) had to have been written in 2017 because it references Ryzen chips which were released in March 2017. It's not entirely clear when the original article was posted but it seems like it could have been either late 2015 or early 2016.

collinmanderson8y ago

Last update is _today_. It links to a comment on this thread.

hungerstrike8y ago

All the CPUs in my house are 4th and 5th generation Intel CPUs except for one PC laptop that has a Skylake processor.

I guess I'm glad now that Apple put a 2 year old CPU in the early 2015 Macbook Pro! Besides my 2012 Mac Pro, that is the most expensive machine in the house!

pixl978y ago

The latest big bug evidently affects all intel x64 processors back to 2007 or so.

earenndil8y ago

I thought it was 1995?

scribu8y ago

I wonder if this is related to the recently discovered design flaw in Intel CPUs:

https://news.ycombinator.com/item?id=16055395

juanmirocks8y ago

Considering the late Intel problems, Apple is going to be even more tempted to design its own CPUs/GPUs for the mac.

What do you think, is this realistic?

I'm not an insider, but it seems like Apple's motto is "long-term profit above almost anything else." As a result they certainly are considering designing their own x86-64 or ARM CPUs in hopes of reduced costs down the line from vertically integrating their business. What may stop them is the fact that their PC sales don't enjoy nearly as much volume as their mobile devices.

Before we all jump on Intel for being buggy, what's the list of serious AMD bugs like for the past five years? If AMD has a similar amount of bugs we should jump on them, too. If not, then we will actually know that Intel deserves being jumped on to the exclusion of AMD.

j / k navigate · click thread line to collapse