A one in a million bug in Switch kernel (opens in new tab)

(gist.githubusercontent.com)

682 pointsvedanshbhartia4y ago88 comments

88 comments

65 comments · 16 top-level

broodbucket4y ago· 17 in thread

>Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

Context switches, idle state transitions, etc tend to be fairly delicately handled as a common cause of CVEs and Heisenbugs. I'm sure there's still plenty of bugs but more attention ends up being paid to these things on general purpose operating systems. More eyeballs on the code, more security researchers, more hardware variants to expose things that were thought to be fine. Also fuzzing.

chc44y ago

For the record: Returning from an interrupt is fully serializing on x86, which I believe means that all modern operating systems on x86 will handle this properly.

https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-r... is a very good blog post about exploiting this for a high-performance membarrier daemon.

throwawaylinux4y ago

But a thread can be context switched via a system call, so switching back to it will return to it via a return from system call which is not necessarily serializing on x86. Also memory operations and barriers executed in kernel mode have to be ordered correctly within a thread and context switches there can happen cooperatively so there may be no interrupt at all.

Also a serializing operation is not a memory barrier. It serializes execution of operations in the pipeline, not necessarily coherency operations after completion.

x86 is mostly TSO except possibly some cases of nontemporal stores and write combining memory types. I don't know the minutiae of the ISA and implementations any more but IIRC it could be possible that stores in a a write combining buffer can be visible out of order.

2 more replies

saagarjha4y ago

Right, but it's possible that this is just a side effect of x86 doing TSO, so any stores are visible just by executing the instruction, so I'm not convinced (at least by that snippet) that interrupts have to be fully serializing. If they're not, I wonder if it's possible to observe load-store reordering across a privilege boundary…

1 more reply

ajross4y ago

FWIW: I re-read this a bunch of times, and I don't understand how this isn't a hardware bug. How can an asynchronous interrupt be specified in any rigorous way if it does NOT act as a memory barrier to the interrupted code? Clearly the CPU isn't going to cache its in-flight state for every interrupt (and remember interrupts can be themselves interrupted!). So certainly "most" of its state is being serialized. And we're supposed to magically guess on a per-IP basis which state isn't? Yikes.

I mean, the fix is the same. But arguing about which OSes "handle this properly" is missing the point. The question to ask is which core IPs (and which configurations thereof, remember the Tegra in question has both A53 and A57 cores) require barriers on interrupt entry, and under what circumstances. If ARM isn't going to publish that errata then asking for OS authors to magically figure it out is just asking for bugs.

jmgao4y ago

Knowing the Tegra chip in question, I'd bet it's probably not ARM's fault. Tegra X1, unlike pretty much every other SoC, did big.LITTLE via cluster migration with a custom cache coherence system, instead of just having a bunch of heterogenous cores. It turns out that their custom cache coherence was unfixably broken and would randomly corrupt memory when doing migration between the big and little cores, so everyone was forced to just entirely disable one set of cores. At least NVIDIA managed to fix this one in software?

1 more reply

throwawaylinux4y ago

> FWIW: I re-read this a bunch of times, and I don't understand how this isn't a hardware bug. How can an asynchronous interrupt be specified in any rigorous way if it does NOT act as a memory barrier to the interrupted code? Clearly the CPU isn't going to cache its in-flight state for every interrupt (and remember interrupts can be themselves interrupted!). So certainly "most" of its state is being serialized. And we're supposed to magically guess on a per-IP basis which state isn't? Yikes.

Memory barriers only concern interactions with other agents that access memory.

Any given thread of execution is always consistent with respect to itself, including when taking interrupts.

"Serialized" is also not the same as a barrier and is not really related to memory consistency. Serialization does only matter within a single thread of execution.

saagarjha4y ago

Most OS vendors will just put a full barrier there and be done with it. Switch doesn't for some reason (and still doesn't, for some reason…not entirely sure why, maybe their context switch is ridiculously fast and the extra barrier would be very expensive?) but when you don't do that you run into problems like these. What's harder is processors storing microarchitectural state that you don't know about, but (post Spectre and Meltdown) CPU vendors have realized that this is not a great idea and started fixing it in microcode and/or hardware.

tedunangst4y ago

It makes sense to me you might need a memory flush when doing a core migration. Everything is still coherent from a single core perspective. But if you point a different core at the same PC, well, maybe it sees things differently that haven't been flushed.

1 more reply

Someone4y ago

> I don't understand how this isn't a hardware bug

From what I understand, if the chip’s documentation would say “all interrupt handlers must start with a memory barrier”, this would be a software bug.

Isn’t it the case that a hardware bug for which a workaround is documented before shipping is ‘just’ a misfeature? (In this case, supporting user-supplied interrupt handlers would be a bit complicated. When it gets installed, you’d have to check their first instruction after first making its memory page non-writable by user code)

Back to the workaround: they seem confident that this only is a problem when doing “user-mode cache operations (flush / clean / zero)”, and those, apparently, can all be fixed to set that TLS flag. If I were trying to break into this system, I would look at both assumptions.

In particular, can you clear that secret byte directly after the kernel set it, and get the old behavior back? Worse, does “user-mode cache operations” imply those are completely run in user mode (since they can make this fix, presumably using a library provided by Nintendo)? If so, what prevents you from using your own cache flush code that doesn’t set the flag?

rustybolt4y ago

I might be missing something but don't see your point.

If you want every interrupt to act as a memory barrier you can just insert a memory barrier in the interrupt handler. A reason not to do this is the overhead. Also, if you know the interrupt handler will not interact with memory from the thing it interrupted or migrate the task it interrupted between cores, it isn't necessary to have a memory barrier.

TuxSH4y ago

> and which configurations thereof, remember the Tegra in question has both A53 and A57 cores

Only the A57 cores are enabled at all, perhaps due to a hardware limitation. Consider the A53 missing for all intent and purposes.

svenpeter4y ago

Will Deacon replied on Twitter how at least Linux handles this correctly [1][2] and I assume that the same is true for XNU and Windows as well.

[1] https://twitter.com/WillDeacon/status/1506375874161086471

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

AceJohnny24y ago

Cache coherency on SoCs is one of the more hairy aspects, and one that's getting an increasing amount of attention from the software world.

Certainly for Mac it's a well-known topic, though I won't necessarily say it's all safe. Like I said: it's hairy.

(fun fact: when you have 10s-100s of millions of units out there, those "1 in a million" chances become all too frequent...)

not2b4y ago

Cache coherency for SoCs is usually proven with formal verification. It's a very good target for such techniques because the needed properties are straightforward to write down.

I'd be very surprised to see any respectable firm shipping a multicore chip where formal verification has not been done on the cache protocol.

Note that this isn't perfect; if a needed property was not written down and was not checked, errors can be missed. But people have been doing this for at least 15 years now, and there's academic work that is older.

Edit: this doesn't mean everyone does it right, there's at least one example in the comments about someone shipping buggy cache coherence.

klauspost4y ago

x86 Darwin had a bug where AVX512 K (opmask) registers would not be saved and restored on context switches when YMM registers had all 0 values. The kernel assumed that AVX512 wasn't used. They fixed it in Monterey 12.2, but still hasn't backported. Heisenbug for sure.

Ref: https://github.com/golang/go/issues/49233#issuecomment-96373...

KMag4y ago

It's unfortunate that more architectures aren't implemented with something like DEC Alpha's PALCode, where the CPU's firmware is essentially a single-tenant hypervisor. The OS kernel proper uses upcalls to the firmware for manipulating page tables, etc. If abstracted properly, the OS would allocate the storage space, but ask the firmware to serialize and deserialize processor state on context switches.

In retrospect, the Alpha went a bit too far with the permissiveness of its memory model, and it turned out they really did need single-byte load and store instructions. However, it was really an elegant high-level design, and the implementation team was top-notch (same folks worked on StrongArm, AMD Athlon, P.A. Semi's PWRficient, and Apple silicon).

Terry_Roll4y ago

I think it helps if the right tools exist in the first place.

throwaway53714y ago· 10 in thread

offtopic: I really wish reading preformatted text files on ios safari was good.. I have to export the file to Books in order to read it properly

terhechte4y ago

I just tried it out on iOS in Safari's reader mode, and it looks quite good. What don't you like about the output?

jez4y ago

The file has diagrams that are meant to be viewed on a screen with enough space to leave long lines unbroken. If you attempt to get things to fit onto a single line by turning the phone sideways, it just zooms the text, instead of reflowing the text onto a larger line width. The problem exists in both the default rendering and the reader view.

Opening in iBooks essentially prints the text file to a PDF, which defaults to US Letter paper size, which has the effect of making line widths large enough for most 80-character text files to fit without awkward mid-line soft breaks.

The only other solution I know of is to manually zoom the page out to 50%. Luckily the zoom setting is saved by domain, so in this case if you want all raw githubusercontent files to view zoomed out iOS will remember that, but on domains where it’s a mix of text and HTML it’s more annoying.

broodbucket4y ago

Firefox reader mode is nice for taking various webpages, whether it's a news article with bloated JS, a wiki page, simple plaintext etc and giving it a common presentation that you can configure to your liking.

No clue if that's a solution for you on iOS but it's a great feature.

zinekeller4y ago

> No clue if that's a solution for you on iOS but it's a great feature.

Unfortunately, iOS browsers tends to be just reskins of Safari (because it's required by Apple).

1 more reply

eqtn4y ago

iOS Firefox - https://i.imgur.com/1f5PElQ.jpeg

iOS Safari Reader mode - https://i.imgur.com/nDnBSAM.jpeg

iOS Safari - https://i.imgur.com/AhtWv9G.jpeg

kevincox4y ago

I think people need to use text files less. Or at least stop hard-wapping them.

opan4y ago

I love text files and kinda like hard wrapping as well.

Thrymr4y ago

Does that mean that a browser shouldn't be able to display a text file with explicit line breaks, though?

hoseja4y ago

Hard-wrapping is my bane. Especially in RFCs.

aspenmayer4y ago

https://gist.github.com/plutooo/2aadbd4a718e269df474079dd2e5...

throwawaybyeeee4y ago· 5 in thread

Reminds me of a similar bug that I worked on a few years ago that led to my single-line contribution to xnu (apologies for the dissertation):

We had increasing reports of devices panicking because the kernel stopped draining a buffer, causing the buffer to fill. This particular buffer should never fill, so if it does -> panic.

The first problem was that this bug was getting 'hot'. The bug needed to be fixed yesterday, and with the number of internal panics being reported, it was looking like it might delay shipping the OS. I was getting pinged constantly, and was expected to give daily updates in a giant cross-org shunning, the "bug review board" or BRB.

The second problem was, of course, that all the code looked fine. (Spoiler: it was. Sort of.) The relevant drivers were handling synchronization properly and appeared to be race-free, memory management looked fine, no uninitialized variables, etc. No problem, we'll just reproduce it then...

The third problem was that the bug was extremely hard to reproduce. With a single device it could take weeks to hit a single occurrence. So I needed a lot of devices, and every repro had to count.

At this point it was clear that I needed some USB hubs, so off to Fry's (RIP). Two giant USB hubs, one Toblerone bar, and an abundance of charity from QA later, I had ~15 devices hooked to a computer. With this battery of devices I was reproducing the issue once every few days.

Reproducing the bug reliably was a breakthrough, but root-causing the bug still felt like a dim prospect. The cores from the panics showed no smoking gun (our drivers' state looked fine), and my kernel mods to add simple lockless tracing seemed to suppress the bug, in true heisenbug fashion. And of course you're never sure if it actually suppressed the bug -- maybe you just didn't wait enough days?

~6 weeks had passed, filled with BRBs, all-nighters, working weekends, and testing tons of theories, all to no avail. On a whim I decided to revisit my lockless tracing strategy and remove a memory barrier. Alas! The bug triggered and I had tracing data!

Digging into the tracing data, it turned out the problem wasn't in our drivers at all, but was actually in the kernel (IOKit) itself, IOInterruptController specifically. The problem was that IOIC was setting a flag and then immediately enabling interrupts via a MMIO write. With this logic, it was possible for another core to service an interrupt (since they were just enabled via the MMIO write), but still observe the old value of the flag, because there was no barrier between setting the flag and enabling interrupts. (Hence why the barrier added by my original tracing suppressed the bug.) Because IOIC read the wrong flag value, it entered a state that prevented interrupts from being serviced, and our buffer would fill and we'd panic. The fix was to simply add a memory barrier to IOIC between setting the flag and enabling interrupts.

To this day I'm still mystified as to why this bug hadn't caused broken interrupts (+ mysterious behavior) or mass panics before then. There must've been some other change to xnu that exposed the bug somehow, but I'll probably never know.

rkangel4y ago

> To this day I'm still mystified as to why this bug hadn't caused broken interrupts (+ mysterious behavior) or mass panics before then. There must've been some other change to xnu that exposed the bug somehow, but I'll probably never know.

My favourite species of software bugs are the ones that when you find the source you realise the code is fundamentally broken, and you get to investigate how the hell it worked for so long!

fps-hero4y ago

Thank you for sharing, gosh that must have been horrific to track down.

There really is something about USB buses that cause the worst kinds of errors. I happen to know that the USB2 driver in the RPI 1/2/3/4 has a Linux kernel corruption bug which is completely masked by the use of a USB hub.

Why does it matter? Because the RPI 1, 2, 3 all hide the USB port behind a hub. Only the zero, and the a series have a naked port. Now, try searching the RPI forum for USB problems and start to notice a correlation.

The problem is that the hive mind decided that all USB errors must be power related, and given the complete dodgeyness of most RPI zero setups it was always assumed this was the culprit.

Unfortunately it isn’t. No amount of probing, decoupling, externally powering ever fixed the glitches, ah but yes, not using an official 3A RPI branded psi was definitely the issue, sorry Agilent your psu’s just aren’t up to task, probably the reason you had to “rebrand” in the first place. :S

We ended up retrofiting a USB hub in-line with a USB connector for a prototype, and we’ve since designed in the hub just for that one USB port used for USB data storage, which is brilliant at the moment because USB hubs ICs are unobtainium, so, we can’t make any more product, because of this software bug.

Every couple of months I would try the latest kernel, but all you needed to do was write to disk continuously and you would hit the bug in 4 hours max, 10 minutes on average. The best part is the kernel corruption kills the file system, we got a trace on a monitor (normally a headless system) but if you ssh’d in, you were dropped into and empty file system and you couldn’t run any tool to diagnose a problem, a simply tab competition would hang the shell. Fun times.

Never bother reporting the bug because I found hundreds of threads on the forums detailing similar issues. It is very uncool that to this day they still insist on using their bespoke driver instead of trying to mainline their performance fixes, otherwise everyone using the dwc2 ip would have befitted, and this bug would have been fixed with hundreds more eyeballs on the problem, not just the one USB guy at RPI towers.

akino_germany4y ago

Avoid high-throughput devices is even in their list of known USB issues, so they must have some idea: https://www.raspberrypi.com/documentation/computers/raspberr...

saagarjha4y ago

No apologies needed, this is why people visit Hacker News :)

mturmon4y ago

Great anecdote, thanks for sharing your dissertation ;)

gpderetta4y ago· 4 in thread

First of all, it is amazing that the author managed to analyze the patch in so much details, it probably is an effort comparable to the bug fix itself. Still I think the article is missing some bits. I would expect any core migration to require barriers (either implicit or explicit) on both the old and new core otherwise the process would risk seeing its own stores and loads out of order.

But in this case the barrier is predicated on the execution of some cache manipulation instruction, so I suspect things are more complicated. Maybe these specific cache manipulation instructions do not respect the usual architectural memory ordering and require some different set of barriers. Possibly they bypass cache coherence completely and require an actual flush of the cache. That is going to be very expensive and it make sense that it is only done only if the process was actually fiddling with these instructions. 'jmgao' else thread reported that tegra has coherency issues on migration, so it might be related.

saagarjha4y ago

> But in this case the barrier is predicated on the execution of some cache manipulation instruction, so I suspect things are more complicated.

Why do you think so? The explanation given seem reasonable to me…

gpderetta4y ago

As I sad, I would expect the barriers to be needed unconditionally on a core migration. The fact that there is a special flag that is set when (and only when) the cache control instructions are used seem to point to some special handling specifically for those instructions.

Edit: having read the page for the nth time, I think I finally understand your point. The code using the cache instructions had an explicit barrier already, but it would be executed on the wrong thread.

I know nothing about the arm memory model, but likely the dsb sy barrier is a stronger barrier than needed for intercore communication, and it is needed for IO serialisation, for example with an mapped PCI device.

So yes, the article is clear and likely correct, I just failed to understand it fully originally.

2 more replies

cwzwarich4y ago

…

gpderetta4y ago

After rereading the article, the "missing bit", which is actually tangentially touched in the article is that, the barrier is not needed to synchronize between the two core, but to synchronize with other hardware, for example the GPU (hence the note about graphic glitches). So the context switching code need to issue the barrier from the correct core. The Linux kernel for example always issue the additional I/O barrier on core migration.

veltas4y ago· 3 in thread

> And how do you even find / debug a bug like this?

As someone who has worked on cache code, I suspect it's quite possible they were just reviewing this code again and realised the potential hole. Or they were trying to track down some horrific bug and fixed this along the way (whether or not it caused it), reviewing anything to do with caching is probably worth doing because it's notoriously difficult to get right, especially with context switching involved.

Another possibility is that the bug is more deterministic than it looks, under the right conditions, and they managed to replicate it and analyse it in a debugger.

snarf214y ago

In Computer Science there are only two hard problems: Cache Invalidation, Naming Things and Off By One Errors.

retSava4y ago

Even low-probability bugs will surface often enough if you give it enough potential times to do so. There are >100 Mn Switch'es out there, and the interrupts happens at least tens to hundreds of times a second when in use, so plenty of opportunities :)

veltas4y ago

Yep but can they reproduce it? When we say "low probability" we're acting like it's truly random, but in reality they could have stumbled across steps that reproduce it very frequently.

2 more replies

dstick4y ago· 3 in thread

"In the fragile reality of Discworld, and with the gods who like to play games, a million-to-one chance succeeds nine times out of ten."

https://wiki.lspace.org/Million-to-one_chance

grumple4y ago

If you have millions of ops per day, a million-to-one chance of something means you'll see it every day! And it only takes a few noisy customers to bring these issues to light.

Agentlien4y ago

I saw this first hand when I was overseeing the crash reports during the release of an online AAA game. A game with millions of players.

A fairly frequent crash bug was caused by a line with a comment explaining that it could theoretically cause a crash but that risk would be one in a million.

1 more reply

grayclhn4y ago

To be super pedantic, for one million ops it's closer to a 63% chance every day:

Pr[something happens across 1_000_000 events]

= 1 - Pr[nothing happens across 1_000_000 events]

= 1 - Pr[nothing happens once]^1_000_000 ## assuming independence

= 1 - (1 - Pr[something happens once])^1_000_000

= 1 - (1 - 1/1_000_000)^1_000_000

≈ 1 - 0.378

= 0.632

It's still below 99% for 4 million ops.

1 more reply

richardfey4y ago· 3 in thread

Could this have been used for some exploit and that is why Nintendo prioritised it and fixed it?

TravelPiglet4y ago

Or, hopefully, they are releasing a new Switch with more cores and it manifested it self more often on that hardware. :)

Agentlien4y ago

As a player I feel that would be pretty cool. As a developer I would absolutely love it.

edit: this says something about priorities. It bothers me quite a lot how much I need to simplify the graphics for the switch versions of games I work on. It hardly bothers me at all when I play games on switch that the visual fidelity is lower on switch

fallat4y ago

...this sounds so plausible. So so plausible.

throwawaylinux4y ago· 2 in thread

> Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

Rubbish. These kernels (well Linux and Windows) run on systems with hundreds even thousands of cores, on CPUs which are very weakly ordered, with a pretty reasonable level of reliability. A race like this will blow up immediately.

Linux handles this by requiring that a context switch operation includes a full memory barrier so switching off CPU0 has a barrier ordering prior stores on CPU0 with storing a field that implies the task can be migrated (it's not currently running), and switching on to CPU1 has a barrier ordering the load of that flag with subsequent loads from the task on CPU1.

EDIT: here - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

  * The basic program-order guarantee on SMP systems is that when a task [t]
  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
  * execution on its new CPU [c1].

It's informally worded but "activity" basically means memory operations (but could include whacky arch and platform specific things to cover all bases), and "happens before" meaning observable from other CPUs, which is clear in context.

bluenose694y ago

Even if you're not following all the arguments involved, you can brighten your day by spending a few moments reading the documentation in this linux code (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...), which is a great example of how to document complex code.

gpderetta4y ago

Note that this specific bug seem to happen only after cache operations (i.e. something like CLFLUSH, CLZERO in x86 parlance). It is possible that these instructions on the Switch SoC require a different barrier either because of spec details or hardware bugs.

fhars4y ago· 1 in thread

Reminds me of this game: https://deadlockempire.github.io/ where you have to think about scheduling problems similar to this.

sneezey4y ago

TIL about this - I've had a few minutes of fun already ha!

jzer0cool4y ago· 1 in thread

What might be a few simple hello world projects to begin a journey understanding how to debug something like this? I suppose understanding of OS is important along with assembly. Given basic knowledge here, could someone list a few lessons to try and any toolsets?

q3k4y ago

Write a toy operating system: https://wiki.osdev.org/Main_Page

For example, start with https://wiki.osdev.org/Bare_Bones or https://wiki.osdev.org/Raspberry_Pi_Bare_Bones

You'll never build anything practical, but it's a great way to learn thing that you'd rarely have the opportunity to learn otherwise. Armed with that wide but shallow knowledge, you'll suddenly see many new opportunities to learn / do things that you wouldn't even have thought of before.

zinekeller4y ago

> This bug has existed since day zero, which means that it took 5 years (!) for Nintendo to track it down. Credits to whoever nameless employee at Nintendo found this bug! The attention to detail is incredible. And how do you even find / debug a bug like this? Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

Considering the various kernel-level races in mainstream kernels (*BSD, Darwin, Linux and NT), I actually doubt that these kinds of bug are fully eliminated (only fixed in cases where such race has security implications).

kabdib4y ago

I spent a couple weeks finding a similar bug, a one instruction window where a hardware "wake up" register could get a stale value if an interrupt-and-reschedule happened at just the right instruction. The fix was to swap two instructions, so a register write happened in the correct order.

I still remember the moment of clarity when the very thorny, complicated problem resolved into something obvious and simple, with a trivial fix. Hard problems seldom resolve so easily. You don't get these very often, cherish them :-)

trinovantes4y ago

CPU interrupts were the bane of my existence back in my undergrad OS class

They were incredibly rare/difficult to replicate and reason through. Props to the unknown engineer that solved this

omegacharlie4y ago

> And how do you even find / debug a bug like this?

I would imagine it takes a hardware debbugger for breakpoints, inspecting CPU register state, etc.

offtopic: This is a post with link to the raw gist and another to gist.github.com

jeffybefffy5194y ago

I love these kinda of bugs, so simple yet so complex.

novok4y ago

I'm surprised that nintendo maintains their own OS for the switch console too, instead of using linux or other unix derivative : https://en.wikipedia.org/wiki/Nintendo_Switch_system_softwar...

j / k navigate · click thread line to collapse

88 comments

65 comments · 16 top-level

broodbucket4y ago· 17 in thread

>Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

chc44y ago

For the record: Returning from an interrupt is fully serializing on x86, which I believe means that all modern operating systems on x86 will handle this properly.

https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-r... is a very good blog post about exploiting this for a high-performance membarrier daemon.

throwawaylinux4y ago

Also a serializing operation is not a memory barrier. It serializes execution of operations in the pipeline, not necessarily coherency operations after completion.

2 more replies

saagarjha4y ago

1 more reply

ajross4y ago

jmgao4y ago

1 more reply

throwawaylinux4y ago

Memory barriers only concern interactions with other agents that access memory.

Any given thread of execution is always consistent with respect to itself, including when taking interrupts.

"Serialized" is also not the same as a barrier and is not really related to memory consistency. Serialization does only matter within a single thread of execution.

saagarjha4y ago

tedunangst4y ago

1 more reply

Someone4y ago

> I don't understand how this isn't a hardware bug

From what I understand, if the chip’s documentation would say “all interrupt handlers must start with a memory barrier”, this would be a software bug.

rustybolt4y ago

I might be missing something but don't see your point.

TuxSH4y ago

> and which configurations thereof, remember the Tegra in question has both A53 and A57 cores

Only the A57 cores are enabled at all, perhaps due to a hardware limitation. Consider the A53 missing for all intent and purposes.

svenpeter4y ago

Will Deacon replied on Twitter how at least Linux handles this correctly [1][2] and I assume that the same is true for XNU and Windows as well.

[1] https://twitter.com/WillDeacon/status/1506375874161086471

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

AceJohnny24y ago

Cache coherency on SoCs is one of the more hairy aspects, and one that's getting an increasing amount of attention from the software world.

Certainly for Mac it's a well-known topic, though I won't necessarily say it's all safe. Like I said: it's hairy.

(fun fact: when you have 10s-100s of millions of units out there, those "1 in a million" chances become all too frequent...)

not2b4y ago

Cache coherency for SoCs is usually proven with formal verification. It's a very good target for such techniques because the needed properties are straightforward to write down.

I'd be very surprised to see any respectable firm shipping a multicore chip where formal verification has not been done on the cache protocol.

Edit: this doesn't mean everyone does it right, there's at least one example in the comments about someone shipping buggy cache coherence.

klauspost4y ago

Ref: https://github.com/golang/go/issues/49233#issuecomment-96373...

KMag4y ago

Terry_Roll4y ago

I think it helps if the right tools exist in the first place.

throwaway53714y ago· 10 in thread

offtopic: I really wish reading preformatted text files on ios safari was good.. I have to export the file to Books in order to read it properly

terhechte4y ago

I just tried it out on iOS in Safari's reader mode, and it looks quite good. What don't you like about the output?

jez4y ago

broodbucket4y ago

No clue if that's a solution for you on iOS but it's a great feature.

zinekeller4y ago

> No clue if that's a solution for you on iOS but it's a great feature.

Unfortunately, iOS browsers tends to be just reskins of Safari (because it's required by Apple).

1 more reply

eqtn4y ago

iOS Firefox - https://i.imgur.com/1f5PElQ.jpeg

iOS Safari Reader mode - https://i.imgur.com/nDnBSAM.jpeg

iOS Safari - https://i.imgur.com/AhtWv9G.jpeg

kevincox4y ago

I think people need to use text files less. Or at least stop hard-wapping them.

opan4y ago

I love text files and kinda like hard wrapping as well.

Thrymr4y ago

Does that mean that a browser shouldn't be able to display a text file with explicit line breaks, though?

hoseja4y ago

Hard-wrapping is my bane. Especially in RFCs.

aspenmayer4y ago

https://gist.github.com/plutooo/2aadbd4a718e269df474079dd2e5...

throwawaybyeeee4y ago· 5 in thread

Reminds me of a similar bug that I worked on a few years ago that led to my single-line contribution to xnu (apologies for the dissertation):

We had increasing reports of devices panicking because the kernel stopped draining a buffer, causing the buffer to fill. This particular buffer should never fill, so if it does -> panic.

The third problem was that the bug was extremely hard to reproduce. With a single device it could take weeks to hit a single occurrence. So I needed a lot of devices, and every repro had to count.

rkangel4y ago

My favourite species of software bugs are the ones that when you find the source you realise the code is fundamentally broken, and you get to investigate how the hell it worked for so long!

fps-hero4y ago

Thank you for sharing, gosh that must have been horrific to track down.

The problem is that the hive mind decided that all USB errors must be power related, and given the complete dodgeyness of most RPI zero setups it was always assumed this was the culprit.

akino_germany4y ago

Avoid high-throughput devices is even in their list of known USB issues, so they must have some idea: https://www.raspberrypi.com/documentation/computers/raspberr...

saagarjha4y ago

No apologies needed, this is why people visit Hacker News :)

mturmon4y ago

Great anecdote, thanks for sharing your dissertation ;)

gpderetta4y ago· 4 in thread

saagarjha4y ago

> But in this case the barrier is predicated on the execution of some cache manipulation instruction, so I suspect things are more complicated.

Why do you think so? The explanation given seem reasonable to me…

gpderetta4y ago

So yes, the article is clear and likely correct, I just failed to understand it fully originally.

2 more replies

cwzwarich4y ago

…

gpderetta4y ago

veltas4y ago· 3 in thread

> And how do you even find / debug a bug like this?

Another possibility is that the bug is more deterministic than it looks, under the right conditions, and they managed to replicate it and analyse it in a debugger.

snarf214y ago

In Computer Science there are only two hard problems: Cache Invalidation, Naming Things and Off By One Errors.

retSava4y ago

veltas4y ago

Yep but can they reproduce it? When we say "low probability" we're acting like it's truly random, but in reality they could have stumbled across steps that reproduce it very frequently.

2 more replies

dstick4y ago· 3 in thread

"In the fragile reality of Discworld, and with the gods who like to play games, a million-to-one chance succeeds nine times out of ten."

https://wiki.lspace.org/Million-to-one_chance

grumple4y ago

If you have millions of ops per day, a million-to-one chance of something means you'll see it every day! And it only takes a few noisy customers to bring these issues to light.

Agentlien4y ago

I saw this first hand when I was overseeing the crash reports during the release of an online AAA game. A game with millions of players.

A fairly frequent crash bug was caused by a line with a comment explaining that it could theoretically cause a crash but that risk would be one in a million.

1 more reply

grayclhn4y ago

To be super pedantic, for one million ops it's closer to a 63% chance every day:

Pr[something happens across 1_000_000 events]

= 1 - Pr[nothing happens across 1_000_000 events]

= 1 - Pr[nothing happens once]^1_000_000 ## assuming independence

= 1 - (1 - Pr[something happens once])^1_000_000

= 1 - (1 - 1/1_000_000)^1_000_000

≈ 1 - 0.378

= 0.632

It's still below 99% for 4 million ops.

1 more reply

richardfey4y ago· 3 in thread

Could this have been used for some exploit and that is why Nintendo prioritised it and fixed it?

TravelPiglet4y ago

Or, hopefully, they are releasing a new Switch with more cores and it manifested it self more often on that hardware. :)

Agentlien4y ago

As a player I feel that would be pretty cool. As a developer I would absolutely love it.

fallat4y ago

...this sounds so plausible. So so plausible.

throwawaylinux4y ago· 2 in thread

> Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I doubt it!

EDIT: here - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

  * The basic program-order guarantee on SMP systems is that when a task [t]
  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
  * execution on its new CPU [c1].

bluenose694y ago

gpderetta4y ago

fhars4y ago· 1 in thread

Reminds me of this game: https://deadlockempire.github.io/ where you have to think about scheduling problems similar to this.

sneezey4y ago

TIL about this - I've had a few minutes of fun already ha!

jzer0cool4y ago· 1 in thread

q3k4y ago

Write a toy operating system: https://wiki.osdev.org/Main_Page

For example, start with https://wiki.osdev.org/Bare_Bones or https://wiki.osdev.org/Raspberry_Pi_Bare_Bones

zinekeller4y ago

kabdib4y ago

trinovantes4y ago

CPU interrupts were the bane of my existence back in my undergrad OS class

They were incredibly rare/difficult to replicate and reason through. Props to the unknown engineer that solved this

omegacharlie4y ago

> And how do you even find / debug a bug like this?

I would imagine it takes a hardware debbugger for breakpoints, inspecting CPU register state, etc.

offtopic: This is a post with link to the raw gist and another to gist.github.com

jeffybefffy5194y ago

I love these kinda of bugs, so simple yet so complex.

novok4y ago

I'm surprised that nintendo maintains their own OS for the switch console too, instead of using linux or other unix derivative : https://en.wikipedia.org/wiki/Nintendo_Switch_system_softwar...

j / k navigate · click thread line to collapse