Packets of Death (opens in new tab)

(blog.krisk.org)

734 pointsquentusrex13y ago113 comments

113 comments

83 comments · 32 top-level

guylhem13y ago· 8 in thread

That is great HN content!

Debugging deep down the rabbit hole, until you find a bug in the NIC EEPROM - and the disbelief many show when hearing a software message can bring down a NIC.

I for one would enjoy reading more content like this on HN that what qualifies as best as a friday-night hack

brazzy13y ago

> the disbelief many show when hearing a software message can bring down a NIC.

Shouldn't be a surprise to anyone. Firmware is just software, and it necessarily deals with raw bytes. Not really surprising that it can contain bugs that are triggered by certain byte patterns.

ajross13y ago

Indeed. But unlike "just software", firmware is never, ever fixable in source form except deep within the bowels of hardware manufacturers. Even when found, firmware bugs are generally ignored. I wonder if anyone has checked: has the fixed image been released to linux-firmware, or the Windows driver yet? Will it ever?

Just browse through the ./drivers tree of the kernel source some time and look at all the "quirks" and "workarounds" there. Recognize that virtually all of those could have been fixed in firmware, but weren't because no one cared.

It's just deeply depressing. This was a virtuoso debugging performance, but it didn't have to be that way if hardware companies were sane. But they aren't, and we all pay for it.

(edit: I just checked for myself, the Linux e1000e driver apparently doesn't support runtime firmware update via the kernel API, so linux-firmware wouldn't be expected to have this. I don't know what the process is for an affected end user to get a copy. I suspect there is none.)

voidlogic13y ago

It would be interesting to see what the actual error in the NIC firmware source was.

This invalidates my assumption that a shop like Intel probably uses formal verification in firmware development.

Its also scary to consider how many very important (nuclear/damn control, etc) systems while themselves might be formally verified are dependent on services of lower level software (OS, drivers) and hardware (firmware) that are not...

2 more replies

moconnor13y ago

In this light, the surprising thing is that we don't hear about firmware bugs more often. Have manufacturers have been doing a great job of keeping things simple and exhaustively testing, driven by the fear of the losses an incident and recall would cause?

2 more replies

curiousdannii13y ago

I know a dev on the OLPC project who told me they found bugs in the firmware of the SD cards they were using for their laptop 'harddrive'. I imagine it'd be a whole lot of fun fixing them.

noselasd13y ago

You may then enjoy these:

* http://www.youtube.com/watch?v=euMHlV6MNqs

* http://www.youtube.com/watch?v=8Q8EFwKVKdA

guylhem13y ago

(watched this and submitted it immediately to HN)

HIGHLY recommended to anyone who enjoyed the article.

It's a remote PHY injection at the hardware level - basically hacking the chip directly thanks to "special" content, i.e. putting a packet header within the packet.

sideproject13y ago

That is some serious debugging, supported by even more serious persistence by the author. Total respect.

engtech13y ago· 8 in thread

As someone who works with FPGAs/ASICs, this isn't that weird.

Everything gets serialized/deserialized these days, so there's all kinds of boundary conditions where you can flip just the right bit and get the data to be deserialized the wrong way.

What's more interesting is that it bypasses all of the checks to prevent this from happening.

Here is the wiki page on the INVITE OF DEATH which sounds like the problem you hit:

http://en.wikipedia.org/wiki/INVITE_of_Death

huhtenberg13y ago

> Everything gets serialized/deserialized these days, ... and get the data to be deserialized the wrong way.

Can you elaborate? I recognize the words, but not the meaning.

bigiain13y ago

Anybody else waiting for him to reply with something like:

"Oh yeah, I used to work at Intel - that nic's got a YAML parser in it"…

noselasd13y ago

Often there's just a pair of wires/pins into a chip that you use to control the chip, for a NIC, modem, radio, what have you - accompanying this is a protocol you use to comminicate with the chip.

For e.g. a NIC, it's not that many things that need to go wrong when you encapsulate a packet it should forward out on the wire so it rather looks like a control packet triggering some undesired results on the chip. Or vice versa.

engtech13y ago

The INVITE of death was discovered on Feb 16th, 2009.

http://ims-bisf.nexginrc.org/OpenSBC-vul.html

1 more reply

jacquesm13y ago

It's the payload that triggers the bug, not a header!

Definitely not this bug, the one linked is not intel specific.

mikeash13y ago

INVITE of Death looks completely different. There, malformed packets would cause trouble in the VoIP software that was trying to parse them. Here, well-formed packets would sometimes cause trouble in an ethernet controller that shouldn't even be trying to parse them.

IheartApplesDix13y ago

Well, it's very weird if you understand anything about Network protocols. It's a layer of complexity that shouldn't be being touched by your NIC, so there shouldn't be a bug there because there shouldn't be code there.

bigiain13y ago

<hat type="conspiracy theorist">I wonder what _other_ data coming down the wire that nic is monitoring and executing code in response to?</hat>

1 more reply

wglb13y ago· 7 in thread

Very good detective work. However, a small suggestion, given:

I’ve been working with networks for over 15 years and I’ve never seen anything like this. I doubt I’ll ever see anything like it again.

This is a very excellent case for fuzz testing. My thinking is that you want to whip up your Ruby and your EventMachine and Redis going and run a constant fuzz with all sorts of packets in your pre-shipping lab.

The idea is that you want to create a condition where you do see it, and the other handful of lockups that are there that you haven't yet seen.

laughinghan13y ago

Fuzz testing would have been very unlikely to help since any byte value at that position besides ASCII 1, 2, or 3 "inoculated" the NIC from the bug. There is a very excellent case to be made for fuzz testing, but this isn't it.

Given that, for all we know the relevant parties did conduct extensive fuzz testing and your condescension is misplaced.

wglb13y ago

No condescension is intended. Fuzz testing is often not what many folks think of in a situation like this.

It wouldn't surprise me at all if there are other issues that don't depend on these exact circumstances to get something to fall over.

Jabbles13y ago

Surely that's the manufacturer's job.

Since it's caused by a specific byte at a specific place, surely you'd only need to fuzz an average of 256 packets (of the required length) to find it... which suggests it wasn't done at all... zero...

wisty13y ago

Or it was done once, and the "inoculation" byte came first.

There were 4 buggy bytes. 3 would crash the card. 1 would fix the card ("inoculating" it). If they only fuzzed once, or fuzzed with the same dataset (random.seed(1) // seed chosen at random), then they wouldn't catch it.

Even then, it's pretty sloppy.

wglb13y ago

Well, my feeling is that if you are going to ship it, you better fuzz it. And if you are assembling something with this card, you are the manufacturer.

A true fuzzing run for something like this ought to run in the tens of millions of packets.

arnsholt13y ago

That's assuming you know the magical position. If you need to test all positions, it's 256 to the power of the number of bytes in the message.

1 more reply

fusiongyro13y ago

It is, but your clients are going to complain to you.

EvanAnderson13y ago· 6 in thread

I've always had mixed emotions about NICs that have hardware assisted offload features. I welcome the decrease in CPU utilization and increased throughput, but the NIC ends up being a complex system that very subtle bugs can lurk inside versus being a simple I/O device that a kernel driver controls.

If there's denial of service hiding in there I wonder about what other security bugs might be lurking. It's scary stuff, and pretty much impossible to audit yourself.

Edit:

Also, I'm a little freaked-out that the EEPROM on the NIC can be modified easily with ethtool. I would have hoped for some signature verification. I guess I'm hoping for too much.

Edit 2:

I wonder if this isn't the same issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=632650

jevinskie13y ago

Be very afraid of PCI firmwares. You can insert rootkits there that have full access to RAM. An IOMMU can mitigate this threat.

EvanAnderson13y ago

It sounds like, in this case, the OP is talking about the EEPROM holding code executed by the embedded coprocessor on the NIC (or, at least, lookup tables that the coprocessor uses) rather than a PCI option ROM that will be executed by the host computer's CPU. Depending on how the access to the EEPROM is performed (i.e. if such access is facilitated by the co-processor versus being read out directly from the EEPROM) I'd think an attacker could even implement "stealth" functionality to allow the compromised EEPROM to appear to be benign when audited.

Depending on what functionality is being offloaded to the NIC (are there still NICs that do IPSEC and crypto offload?) there's the possibility for information disclosure vulnerabilities in the NIC itself. Yikes.

1 more reply

dfox13y ago

The EEPROM does not contain the actual firmware, but it's configuration (or configuration of hardware itself in case of simple designs). Essentially any Fast Ethernet and newer NIC has some configuration EEPROM that is accessible by ethtool (which is what the tool is for after all). Common use cases for ethtool are persistently changing MAC address and fixing broken hardware (either to work around pre-existing HW bugs or to fix it after this EEPROM got somehow erased).

Generally the EEPROM does not contain anything like executable code, although it probably can contain patches for microcode on many NICs.

tgcyhv13y ago

Isn't the EEPROM "patch" with ethtool simply sending the inoculating packet (the one with the "40" value) through the network stack?

EvanAnderson13y ago

That's not the impression I'm getting from the article or the author's comments.

1 more reply

api13y ago

Makes me wonder if someday a bug (or "administrative function") in the firmware of a hardware device used at a major virtualization farm (EC2, Linode, VPS.NET, etc.) will be used to r00t the hardware nodes and then go wild all over everything there. Be afraid.

meshko13y ago· 4 in thread

I have mixed feelings about the write up. I think it gets clear pretty early on that the issue is in the NIC hardware at which point it is time to stop wasting your time investigating problem you can't fix and start contacting the vendor.

jerdfelt13y ago

In my experience dealing with a similar bug (see my other post in the thread), the vendors will immediately assume it's not their problem.

They spent a long time "showing" us that a different version of the Linux kernel didn't exhibit the problem so it must be a Linux kernel bug. Turned out the different version just sent data differently so it didn't trigger the same bug with the same data. Other data would have triggered it.

I wouldn't be surprised if the majority of "bugs" they receive reports on turn out to not be bugs in their hardware. There's probably parallels with the reports of compiler bugs, most end up not being bugs in the compiler.

The unfortunate truth is that responsibility of proving it's the vendors bug falls on the customer.

I had to write a proof-of-concept "exploit" to show the problem was with their hardware, effectively troubleshooting most of the problem for them.

homosaur13y ago

THIS.

It's always someone else's testing procedures, someone's else's hardware... The thing is though, most of the time it is. Tech support at the lower levels especially are used to dealing with people who have bad configurations are are using the products incorrectly. The annoyance comes in when you as a customer narrow a problem down but can't get anyone on the phone who can help you at that level.

EvanAnderson13y ago

The write-up gave me the feeling that the OP did just that. My experience has been that manufacturers will blow you off unless you can provide a reproducible test case. Certainly, his company doesn't sound like it has the volume necessary to threaten Intel w/ moving to a different NIC manufacturer unless the problem is resolved. With that in mind, I think the amount of work he did was just the right amount.

vacri13y ago

"Closed: Works for me"

jerdfelt13y ago· 3 in thread

I ran into a similar problem with an Intel motherboard about 10 years ago.

We had problems when some NFS traffic would end up getting stalled. Our NFS server would use UDP packets larger than the MTU and they would end up getting fragmented.

Turns out the NIC would not look at the fragmentation headers of the IP packet and always assume a UDP header was present. From time to time, the payload of the NFS packet would have user data that matched the UDP port number the NIC would scan for to determine if the packet should be forwarded to the BMC. This motherboard had no BMC but it was configured as if it did have one.

It would time out after a second or so but in the meantime drop a bunch of packets. The NFS server would retransmit the packet but since the payload didn't change, the NIC would reliably drop the rest of the fragments of the packet.

Of course Intel claimed it wasn't their bug ("it's a bug in the Linux NFS implementation") but they quickly changed their tune when I coded up a sample program that would send one packet a second and reliably cause the NIC to drop 99% of packets received.

While it turned out to be a fairly lame implementation problem on Intel's part (both by ignoring the fragmentation headers and the poor implementation of the motherboard) I have to say it was very satisfying to solve the mystery.

EvanAnderson13y ago

Reading about the OP's issue got me to a doc from Intel (http://www.intel.com/content/dam/doc/application-note/sideba...) re: the "NC Sideband Interface", which sounds like the place where the bug that bit you "lives". Reading over that doc made me shudder a few times, thinking about the complexity and, thus, potential bugs that could be lurking there. I wonder if the OP's bug was related, too.

Having the NIC inspecting incoming frames and potentially diverting them to the management controller sounds like a scary proposition. I'd almost rather just have dedicated Ethernet hardware for the management controller. The decrease in switch ports needed is certainly seductive, but I wonder if it's worth the risk.

(Do you happen to recall which Intel motherboard this bit you on? I was just getting out of whitebox Intel motherboard-based server builds about the time you're describing, but I'm just curious if only for the nostalgia.)

jevinskie13y ago

"IPMI operates independently of the OS and allows administrators to manage a system remotely even without an OS, system management software, and even if the monitored system is powered off (along as it is connected to a power source). IPMI can also function after an OS has started, offering enhanced features when used with system management software."

Yikes! Sounds like system management mode in a BIOS!

2 more replies

jerdfelt13y ago

I don't remember which motherboard it was, sorry. Looking at my resume, this was back in the 2000-2001 time frame.

jacquesm13y ago· 3 in thread

Persistent bugger.

"With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death - and kill client machines behind firewalls!"

That's worrisome, I'll bet there are lots of not-so-nice guys trying to figure out a way to do just that. There must be tons of server hardware out there with these cards in them.

mrb13y ago

I just set up my web server to serve the packet of death:

$ wget http://zorinaq.com/pub/intel-packet-of-death.txt

It has 0x32 at offset 0x47f regardless of the size of the IP and TCP headers. Try to run the wget AS SOON AFTER HAVING COLD BOOTED the machine (it is the very first packet of 1152+ bytes that determines if the NIC will crash or be inoculated until the next cold boot; well... unless it is the "no-op" packet).

Edit: fixed link.

thechut13y ago

I read the whole thing and that is the line that stuck out most to me. This could very scary. It could be used to bring down a webserver

EvanAnderson13y ago

It sounds like the vulnerability could be used to bring down any machine you can send an arbitrary Ethernet frame to. (I immediately wonder if it works for broadcast frames? Sounds like a way to take down a LAN full of machines quickly if it does.)

Edit: Per http://www.kriskinc.com/intel-pod it does work on broadcast frames. Yikes!

1 more reply

0x013y ago· 2 in thread

So is it only the byte at 0x47f that matters? Could you just send a packet filled with 0x32 0x32 0x32 0x32 0x32 to trigger this? (Like, download a file full of 0x32s?) Or does it have to look like a SIP packet?

You'd think the odds of getting a packet with 0x32 in position 0x47f is almost 1/256 per packet? So why aren't these network cards falling over everywhere every few seconds?

wvenable13y ago

Probably because there is a 2/256 chance of getting sent the inoculation value. But it's a good question.

caf13y ago

Later in the article it states that any value other than 0x31, 0x32 or 0x33 acts as an "inoculation value", so that would be a 253/256 chance for each packet of at least 1151 bytes.

1 more reply

corford13y ago· 2 in thread

My servers all have the affected cards (two per machine - yikes!) but so far I can't reproduce the bug (yay).

There are subtle differences between the offsets I get when I run "ethtool -e interface" versus those in the article that indicate an affected card (but they're quite close).

Mine are:

0x0010: ff ff ff ff 6b 02 69 83 43 10 d3 10 ff ff 58 a5

0x0030: c9 6c 50 31 3e 07 0b 46 84 2d 40 01 00 f0 06 07

0x0060: 00 01 00 40 48 13 13 40 ff ff ff ff ff ff ff ff

Output of "ethtool -i interface" (in case anyone wants to compare notes):

driver: e1000e version: 1.5.1-k firmware-version: 1.8-0

I tested both packet replays by broadcasting to all attached devices on a simple Gbit switch and no links dropped.

mrb13y ago

You need to shut down, boot up the server, and do a test right away. The very first packet of 1152 bytes or more that it receives after a cold boot determines if the NIC is going to be affected or "inoculated" (until next cold boot).

corford13y ago

Thanks mrb, I missed that a cold power up was needed. I'm going to try again now but it's a bit tricky as the affected machines are in a different country and I don't have access to full remote power cycling (I can only reset the machines). Hopefully, the data centre staff will be accommodating (after all, if my machines are affected, likely hundreds of their other clients are too as I'm using dedicated servers provided by them).

EDIT: it's difficult to tell definitively doing it remotely but I still can't re-produce the bug after a cold boot.

altcognito13y ago· 2 in thread

http://en.wikipedia.org/wiki/Ping_of_death

huhtenberg13y ago

That was an OS-level bug, it's far less exciting.

altcognito13y ago

I'll agree it's more interesting in that the end-result was a box that required a hard boot, but still these two issues aren't that distantly related: it effected routers and many, many OS platforms, so it's not as if it was related to some implementation detail that MS left out of Windows.

Correct me if I'm wrong (no, seriously) -- aren't both "packets of death" just poor handling of said malformed packets? Violations of their respective protocols? (TCP/SIP)

1 more reply

quentusrexOP13y ago· 2 in thread

It appears to work if you send the packet to the network broadcast address. Quick way to detect if any of the machines are vulnerable(they won't respond to the second ping).

baq13y ago

you conveniently omitted the part in which you walk to the racks and reboot them all.

wglb13y ago

Right! Except it sounds worse--sounded like you needed to cycle power to bring them back.

1 more reply

TapaJob13y ago· 1 in thread

Fantastic Article, Fantastic fine. Well done.

As a telecoms engineer predominantly selling Asterisk for the last 4 years and Asterisk experiance extending back to 2006 it's shocking to see this finally put right. For so many years, I have avoided the e1000 Intel controllers after a very public/embarassing situation when a conferencing server behaved in a wierd manner disrupting core services. Not having the expertise the author has, I narrowed it down to the Eth. Controller, Immediately replaced the server with IBM Hardware with Broadcom chipset and resumed our services in providing conferencing to some of the top FTSE100 companies.

Following this episode, I spend numerous days diagnosing the chipset with many conference calls with Digium engineers debugging the server remotely. In the end, no solution, recommendation to avoid the e1000 chipset and moved on.

TapaJob13y ago

brings back memories....

http://lists.debian.org/debian-isp/2009/06/msg00018.html

elasticdog13y ago· 1 in thread

Before actually testing this with the real payload, is there a better way of determining if you have a potentially vulnerable driver than something like this?

  # awk '/eth/ { print $1 }' <(ifconfig -a) | cut -d':' -f1 | uniq | while read interface; do echo -n "$interface "; ethtool -i $interface | grep driver; done
  eth0 driver: e1000e
  eth1 driver: e1000e

minaguib13y ago

This is not about the particular linux driver, but about a particular chipset, and even then, only sometimes...

The linux e1000e may support many chipsets, so the fact that it's in service on your box doesn't necessarily mean you're running the suspect chipset, or that it's vulnerable.

Check with lspci -v, and check with the concrete test using the cold boot+magic packet others and the OP have posted.

drucken13y ago· 1 in thread

Intruiging.

Intel 82574L ethernet controller looks to be popular too. Intel, Supermicro, Tyan and Asus use it on multiple current motherboards and Asus notably on their WS (Workstation) variants of consumer motherboards, e.g. the Asus P8Z77 WS (socket LGA 1155) and Asus Z9PE-D8 WS (dual CPU, socket LGA 2011).

dfox13y ago

It's quite popular because while it has large amount of weird quirks (usually specific to silicon revision / configuration) it still works and in many cases better than other comparable chipsets.

jws13y ago· 1 in thread

Well this hurts. I have a critical machine with a dual NIC Intel motherboard. I had to abandon the 82579LM port because of unresolved bugs in the Linux drivers, and the other one is a 82574L, the one documented in this post.

I suppose I can send just the right ICMP echo packet to router to make it send me back an innoculating frame.

cdvonstinkpot13y ago

Good luck

ChuckMcM13y ago

Makes me wonder if this is related to in-band management? One of the interesting thing about working at NetApp, which had its own "OS" was that every driver was written by engineering. That allowed the full challenge of some of these devices to be experienced first hand.

One of the more painful summers resulted from a QLogic HBA which sometimes, for no apparent reason, injected a string of hex digits into the data it transmitted. There is a commemorative t-shirt of that bug with just the string of characters. It lead NetApp to putting in-block checksums into the file system so that corruption between the disk and memory, which was 'self inflicted' (and so passed various channel integrity checks) could be detected.

Here at Blekko we had a packet fragment that would simply vanish into the center switch. It would go in and never come out. We never got a satisfactory answer for that one. Keith, our chief architect, worked around it by randomizing the packet on a retransmit request.

The amount of code between your data and you that you can't control is, sadly, way larger than you probably would like.

cheeseprocedure13y ago

I've been unable to reproduce this on systems equipped with the controller in question. I'd love to see "ethtool -e ethX" output for a NIC confirmed to be vulnerable.

/edit Ah, I spoke to soon; the author has updated his page here with diffs between affected and unaffected EEPROMs:

http://www.kriskinc.com/intel-pod

lifeisstillgood13y ago

Can anyone remember the source of the quote :

  Sometimes bug fixing simply takes two people to lock themselves in a room and nearly kill themselves for two days.

Reminded me of this

quentusrexOP13y ago

Updated with more specific info: http://www.kriskinc.com/intel-pod

shawndumas13y ago

http://computer.yourdictionary.com/truck-roll

sc68cal13y ago

I'm not surprised - firmware for ethernet controllers have grown quite complex, with the addition of new features that allow the hardware to do more work on behalf of the kernel.

Could this be a bug in the code of the EEPROM that handles TCP offloading, or one of the other hardware features that are now becoming more common? (https://en.wikipedia.org/wiki/TCP_offload_engine)

devicenull13y ago

Wow, I've run into what seems to be the same problem with this controller before. We "fixed" it by upgrading the e1000 driver.

grego13y ago

I had something similar in my home network, but my network foo is not good enough and I did not have to time to debug for days and weeks.

Basically one linux box with NVidia embedded gigabit controller could take down the whole segment. It would only happen after a random period, like after days when the box was busy. No two machines connected to the same switch would be able to ping each other any more after that. I suspected the switch, bad cables, etc. In the end I successfully circumvented the problem by buying a discrete gigabit ethernet card for the server in question.

noonespecial13y ago

Kielhofner is a pretty awesome guy. I met him a couple of times "back in the day" at Astricon conferences when he was hacking together Astlinux.

He was instrumental in taming the Soekris and Alix SBC boards of old and creating Asterisk appliances with them. If you've got a little asterisk box running on some embedded looking hardware somewhere, it doesn't matter whose name is on the sticker, its got some Kielhofner in it.

I live about a mile from Star2Star. I ought to pop in one of these days and see what they're up to.

astangl13y ago

This seems much more serious than the much-ballyhooed Pentium FDIV bug. Hopefully Intel will be on the ball with notifying people and distributing the fix.

lukego13y ago

Cool!

I'm currently working on an open source project where we are chasing "hang really hard and need a reboot to come back" issues with exactly this same ethernet controller, the Intel 82574L. I wonder if it's related!

Our Github issue: https://github.com/SnabbCo/snabbswitch/issues/39

sriramnrn13y ago

Reminds me of my own adventures with systems hanging on PXE boot when a Symantec Ghost PreOS Image didn't boot up completely, and went on to flood the network with packets. See http://dynamicproxy.livejournal.com/46862.html

spitfire13y ago

This somehow reminds me of the slammer SQL worm. A simply formed single packet caused a tsunami over the internet.

Personally, I am not at all surprised that this sort of thing exists. I'm sure there's lots more defects out there to be found. turning completeness is a cruel master.

viraptor13y ago

It's like a reverse example of a broken packet... You can see a number of interesting samples and stories in the museum of broken packets: http://lcamtuf.coredump.cx/mobp/

X413y ago

Congrats Sir, you've just discovered the Internet Kill-Switch!

The “red telephone,” used to shut down the entire Internet comes to mind.

You discovered howto immunize friends and kill enemies in CyberWars.

Do governments have an Internet kill switch?

Yes, see Egypt & Syria they're good examples. We know China is doing Cyberwars, they are beyond Kill-Switches.

Techcrunch: http://techcrunch.com/2011/03/06/in-search-of-the-internet-k...

Wiki: http://en.wikipedia.org/wiki/Internet_kill_switch

We know Goverments deploy hardware that they can control when needed. Smartphones are the best examples for Goverment issued backdoors, next to some Intel Hardware (including NICs).

Garbage13y ago

Author mentioned a custom package generator tool "Ostinato". I met the author of this tool 2-3 months back. A lone guy working on this tool as a side project. Amazing work. :)

anabis13y ago

Great diligence! I had 1G hubs lockup with Intel 82578DM. I was too lazy track it down, so I just dropped the speed to 100M, which made it work.

j / k navigate · click thread line to collapse

113 comments

83 comments · 32 top-level

guylhem13y ago· 8 in thread

That is great HN content!

Debugging deep down the rabbit hole, until you find a bug in the NIC EEPROM - and the disbelief many show when hearing a software message can bring down a NIC.

I for one would enjoy reading more content like this on HN that what qualifies as best as a friday-night hack

brazzy13y ago

> the disbelief many show when hearing a software message can bring down a NIC.

Shouldn't be a surprise to anyone. Firmware is just software, and it necessarily deals with raw bytes. Not really surprising that it can contain bugs that are triggered by certain byte patterns.

ajross13y ago

It's just deeply depressing. This was a virtuoso debugging performance, but it didn't have to be that way if hardware companies were sane. But they aren't, and we all pay for it.

voidlogic13y ago

It would be interesting to see what the actual error in the NIC firmware source was.

This invalidates my assumption that a shop like Intel probably uses formal verification in firmware development.

2 more replies

moconnor13y ago

2 more replies

curiousdannii13y ago

I know a dev on the OLPC project who told me they found bugs in the firmware of the SD cards they were using for their laptop 'harddrive'. I imagine it'd be a whole lot of fun fixing them.

noselasd13y ago

You may then enjoy these:

* http://www.youtube.com/watch?v=euMHlV6MNqs

* http://www.youtube.com/watch?v=8Q8EFwKVKdA

guylhem13y ago

(watched this and submitted it immediately to HN)

HIGHLY recommended to anyone who enjoyed the article.

It's a remote PHY injection at the hardware level - basically hacking the chip directly thanks to "special" content, i.e. putting a packet header within the packet.

sideproject13y ago

That is some serious debugging, supported by even more serious persistence by the author. Total respect.

engtech13y ago· 8 in thread

As someone who works with FPGAs/ASICs, this isn't that weird.

Everything gets serialized/deserialized these days, so there's all kinds of boundary conditions where you can flip just the right bit and get the data to be deserialized the wrong way.

What's more interesting is that it bypasses all of the checks to prevent this from happening.

Here is the wiki page on the INVITE OF DEATH which sounds like the problem you hit:

http://en.wikipedia.org/wiki/INVITE_of_Death

huhtenberg13y ago

> Everything gets serialized/deserialized these days, ... and get the data to be deserialized the wrong way.

Can you elaborate? I recognize the words, but not the meaning.

bigiain13y ago

Anybody else waiting for him to reply with something like:

"Oh yeah, I used to work at Intel - that nic's got a YAML parser in it"…

noselasd13y ago

Often there's just a pair of wires/pins into a chip that you use to control the chip, for a NIC, modem, radio, what have you - accompanying this is a protocol you use to comminicate with the chip.

engtech13y ago

The INVITE of death was discovered on Feb 16th, 2009.

http://ims-bisf.nexginrc.org/OpenSBC-vul.html

1 more reply

jacquesm13y ago

It's the payload that triggers the bug, not a header!

Definitely not this bug, the one linked is not intel specific.

mikeash13y ago

IheartApplesDix13y ago

bigiain13y ago

<hat type="conspiracy theorist">I wonder what _other_ data coming down the wire that nic is monitoring and executing code in response to?</hat>

1 more reply

wglb13y ago· 7 in thread

Very good detective work. However, a small suggestion, given:

I’ve been working with networks for over 15 years and I’ve never seen anything like this. I doubt I’ll ever see anything like it again.

The idea is that you want to create a condition where you do see it, and the other handful of lockups that are there that you haven't yet seen.

laughinghan13y ago

Given that, for all we know the relevant parties did conduct extensive fuzz testing and your condescension is misplaced.

wglb13y ago

No condescension is intended. Fuzz testing is often not what many folks think of in a situation like this.

It wouldn't surprise me at all if there are other issues that don't depend on these exact circumstances to get something to fall over.

Jabbles13y ago

Surely that's the manufacturer's job.

wisty13y ago

Or it was done once, and the "inoculation" byte came first.

Even then, it's pretty sloppy.

wglb13y ago

Well, my feeling is that if you are going to ship it, you better fuzz it. And if you are assembling something with this card, you are the manufacturer.

A true fuzzing run for something like this ought to run in the tens of millions of packets.

arnsholt13y ago

That's assuming you know the magical position. If you need to test all positions, it's 256 to the power of the number of bytes in the message.

1 more reply

fusiongyro13y ago

It is, but your clients are going to complain to you.

EvanAnderson13y ago· 6 in thread

If there's denial of service hiding in there I wonder about what other security bugs might be lurking. It's scary stuff, and pretty much impossible to audit yourself.

Edit:

Also, I'm a little freaked-out that the EEPROM on the NIC can be modified easily with ethtool. I would have hoped for some signature verification. I guess I'm hoping for too much.

Edit 2:

I wonder if this isn't the same issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=632650

jevinskie13y ago

Be very afraid of PCI firmwares. You can insert rootkits there that have full access to RAM. An IOMMU can mitigate this threat.

EvanAnderson13y ago

1 more reply

dfox13y ago

Generally the EEPROM does not contain anything like executable code, although it probably can contain patches for microcode on many NICs.

tgcyhv13y ago

Isn't the EEPROM "patch" with ethtool simply sending the inoculating packet (the one with the "40" value) through the network stack?

EvanAnderson13y ago

That's not the impression I'm getting from the article or the author's comments.

1 more reply

api13y ago

meshko13y ago· 4 in thread

jerdfelt13y ago

In my experience dealing with a similar bug (see my other post in the thread), the vendors will immediately assume it's not their problem.

The unfortunate truth is that responsibility of proving it's the vendors bug falls on the customer.

I had to write a proof-of-concept "exploit" to show the problem was with their hardware, effectively troubleshooting most of the problem for them.

homosaur13y ago

THIS.

EvanAnderson13y ago

vacri13y ago

"Closed: Works for me"

jerdfelt13y ago· 3 in thread

I ran into a similar problem with an Intel motherboard about 10 years ago.

We had problems when some NFS traffic would end up getting stalled. Our NFS server would use UDP packets larger than the MTU and they would end up getting fragmented.

EvanAnderson13y ago

jevinskie13y ago

Yikes! Sounds like system management mode in a BIOS!

2 more replies

jerdfelt13y ago

I don't remember which motherboard it was, sorry. Looking at my resume, this was back in the 2000-2001 time frame.

jacquesm13y ago· 3 in thread

Persistent bugger.

That's worrisome, I'll bet there are lots of not-so-nice guys trying to figure out a way to do just that. There must be tons of server hardware out there with these cards in them.

mrb13y ago

I just set up my web server to serve the packet of death:

$ wget http://zorinaq.com/pub/intel-packet-of-death.txt

Edit: fixed link.

thechut13y ago

I read the whole thing and that is the line that stuck out most to me. This could very scary. It could be used to bring down a webserver

EvanAnderson13y ago

Edit: Per http://www.kriskinc.com/intel-pod it does work on broadcast frames. Yikes!

1 more reply

0x013y ago· 2 in thread

You'd think the odds of getting a packet with 0x32 in position 0x47f is almost 1/256 per packet? So why aren't these network cards falling over everywhere every few seconds?

wvenable13y ago

Probably because there is a 2/256 chance of getting sent the inoculation value. But it's a good question.

caf13y ago

Later in the article it states that any value other than 0x31, 0x32 or 0x33 acts as an "inoculation value", so that would be a 253/256 chance for each packet of at least 1151 bytes.

1 more reply

corford13y ago· 2 in thread

My servers all have the affected cards (two per machine - yikes!) but so far I can't reproduce the bug (yay).

There are subtle differences between the offsets I get when I run "ethtool -e interface" versus those in the article that indicate an affected card (but they're quite close).

Mine are:

0x0010: ff ff ff ff 6b 02 69 83 43 10 d3 10 ff ff 58 a5

0x0030: c9 6c 50 31 3e 07 0b 46 84 2d 40 01 00 f0 06 07

0x0060: 00 01 00 40 48 13 13 40 ff ff ff ff ff ff ff ff

Output of "ethtool -i interface" (in case anyone wants to compare notes):

driver: e1000e version: 1.5.1-k firmware-version: 1.8-0

I tested both packet replays by broadcasting to all attached devices on a simple Gbit switch and no links dropped.

mrb13y ago

corford13y ago

EDIT: it's difficult to tell definitively doing it remotely but I still can't re-produce the bug after a cold boot.

altcognito13y ago· 2 in thread

http://en.wikipedia.org/wiki/Ping_of_death

huhtenberg13y ago

That was an OS-level bug, it's far less exciting.

altcognito13y ago

Correct me if I'm wrong (no, seriously) -- aren't both "packets of death" just poor handling of said malformed packets? Violations of their respective protocols? (TCP/SIP)

1 more reply

quentusrexOP13y ago· 2 in thread

It appears to work if you send the packet to the network broadcast address. Quick way to detect if any of the machines are vulnerable(they won't respond to the second ping).

baq13y ago

you conveniently omitted the part in which you walk to the racks and reboot them all.

wglb13y ago

Right! Except it sounds worse--sounded like you needed to cycle power to bring them back.

1 more reply

TapaJob13y ago· 1 in thread

Fantastic Article, Fantastic fine. Well done.

TapaJob13y ago

brings back memories....

http://lists.debian.org/debian-isp/2009/06/msg00018.html

elasticdog13y ago· 1 in thread

Before actually testing this with the real payload, is there a better way of determining if you have a potentially vulnerable driver than something like this?

  # awk '/eth/ { print $1 }' <(ifconfig -a) | cut -d':' -f1 | uniq | while read interface; do echo -n "$interface "; ethtool -i $interface | grep driver; done
  eth0 driver: e1000e
  eth1 driver: e1000e

minaguib13y ago

This is not about the particular linux driver, but about a particular chipset, and even then, only sometimes...

The linux e1000e may support many chipsets, so the fact that it's in service on your box doesn't necessarily mean you're running the suspect chipset, or that it's vulnerable.

Check with lspci -v, and check with the concrete test using the cold boot+magic packet others and the OP have posted.

drucken13y ago· 1 in thread

Intruiging.

dfox13y ago

It's quite popular because while it has large amount of weird quirks (usually specific to silicon revision / configuration) it still works and in many cases better than other comparable chipsets.

jws13y ago· 1 in thread

I suppose I can send just the right ICMP echo packet to router to make it send me back an innoculating frame.

cdvonstinkpot13y ago

Good luck

ChuckMcM13y ago

The amount of code between your data and you that you can't control is, sadly, way larger than you probably would like.

cheeseprocedure13y ago

I've been unable to reproduce this on systems equipped with the controller in question. I'd love to see "ethtool -e ethX" output for a NIC confirmed to be vulnerable.

/edit Ah, I spoke to soon; the author has updated his page here with diffs between affected and unaffected EEPROMs:

http://www.kriskinc.com/intel-pod

lifeisstillgood13y ago

Can anyone remember the source of the quote :

  Sometimes bug fixing simply takes two people to lock themselves in a room and nearly kill themselves for two days.

Reminded me of this

quentusrexOP13y ago

Updated with more specific info: http://www.kriskinc.com/intel-pod

shawndumas13y ago

http://computer.yourdictionary.com/truck-roll

sc68cal13y ago

I'm not surprised - firmware for ethernet controllers have grown quite complex, with the addition of new features that allow the hardware to do more work on behalf of the kernel.

Could this be a bug in the code of the EEPROM that handles TCP offloading, or one of the other hardware features that are now becoming more common? (https://en.wikipedia.org/wiki/TCP_offload_engine)

devicenull13y ago

Wow, I've run into what seems to be the same problem with this controller before. We "fixed" it by upgrading the e1000 driver.

grego13y ago

I had something similar in my home network, but my network foo is not good enough and I did not have to time to debug for days and weeks.

noonespecial13y ago

Kielhofner is a pretty awesome guy. I met him a couple of times "back in the day" at Astricon conferences when he was hacking together Astlinux.

I live about a mile from Star2Star. I ought to pop in one of these days and see what they're up to.

astangl13y ago

This seems much more serious than the much-ballyhooed Pentium FDIV bug. Hopefully Intel will be on the ball with notifying people and distributing the fix.

lukego13y ago

Cool!

Our Github issue: https://github.com/SnabbCo/snabbswitch/issues/39

sriramnrn13y ago

spitfire13y ago

This somehow reminds me of the slammer SQL worm. A simply formed single packet caused a tsunami over the internet.

Personally, I am not at all surprised that this sort of thing exists. I'm sure there's lots more defects out there to be found. turning completeness is a cruel master.

viraptor13y ago

It's like a reverse example of a broken packet... You can see a number of interesting samples and stories in the museum of broken packets: http://lcamtuf.coredump.cx/mobp/

X413y ago

Congrats Sir, you've just discovered the Internet Kill-Switch!

The “red telephone,” used to shut down the entire Internet comes to mind.

You discovered howto immunize friends and kill enemies in CyberWars.

Do governments have an Internet kill switch?

Yes, see Egypt & Syria they're good examples. We know China is doing Cyberwars, they are beyond Kill-Switches.

Techcrunch: http://techcrunch.com/2011/03/06/in-search-of-the-internet-k...

Wiki: http://en.wikipedia.org/wiki/Internet_kill_switch

We know Goverments deploy hardware that they can control when needed. Smartphones are the best examples for Goverment issued backdoors, next to some Intel Hardware (including NICs).

Garbage13y ago

Author mentioned a custom package generator tool "Ostinato". I met the author of this tool 2-3 months back. A lone guy working on this tool as a side project. Amazing work. :)

anabis13y ago

Great diligence! I had 1G hubs lockup with Intel 82578DM. I was too lazy track it down, so I just dropped the speed to 100M, which made it work.

j / k navigate · click thread line to collapse