Also, massively over-dramatised. Yes, a bug worth finding and knowing about, but it’s not a time bomb - very few users are likely to be affected by this.
Knowing the nature of OS kernels, I’m guessing even just putting a Mac laptop to sleep would be enough to avoid this issue as it would reset the TCP stack - which may be why some people are reporting much longer uptimes without hitting this problem, since (iirc) uptime doesn’t reset on Macs just for a sleep? Only for a full reboot?
Anyway, all in all, yeah hopefully Apple fix this but it’s not something anyone needs to panic about.
I have a reasonably strong suspicion that I experienced this a week or two back, on a MacBook that doesn't go into sleep automatically and quite likely had 50-ish days of uptime.
It had all the symptoms described - tcp connections not working while I could still ping everywhere just fine, and all the other devices on the same network were fine. Switching WiFi networks and plugging in to ethernet didn't help. A reboot "fixed" it.
It’s not a disaster, but very annoying. At least now I can just schedule a reboot every 30 days at minimum to keep things running.
> We are actively working on a fix that is better than rebooting — a targeted workaround that addresses the frozen tcp_now without requiring a full system restart. Until then, schedule your reboots before the clock runs out.
Sometimes it just stops networking completely, turning the wifi adapter on/off brings it back just fine. It's also a good time to reboot =)
Remember the golden rule: if you can't be bothered to write it yourself, why should your audience be bothered to read it ourselves?
tcp_now = 4,294,960,000 (frozen at pre-overflow value)
The mistake in the blog post is timer isn't wrapped, even though it notes it should be: timer = 4,294,960,000 + 30,000 = 4,294,990,000 - MAX_INT = 22,704
Therefore: TSTMP_GEQ(4294960000, 22704)
= 4294960000 - 22704
= 4294937296
= 4294937296 >= 0 ? → true! (not false)
This is a bug of course, but it would cause sockets in TCP_WAIT state to be reaped anytime tcp_gc() is called, regardless of whether 2*MSL has passed or not. This only happens though if tcp_now gets stuck after 4,294,937,296 ms from boot.A bug similar to what the blog described can happen however if tcp_now gets stuck at least 30 seconds before it it would have wrapped. Since tcp_now is only updated if there is TCP traffic, this can happen if there is no TCP traffic for at least 30 seconds before before it would roll over (MAX_INT ms from boot).
It's should be easy to prevent the latter from happening with some TCP traffic, though reaping TCP_WAIT connections early isn't great either.
says 51 days, which would be an interesting number of (milli)seconds
You don't have to run the system for 50 days. You can simulate the environment and tick the clock faster. Many high reliability systems are tested this way.
If you wanted to see how time impacts the program, you'd prob change fns like calculate_tcp_clock to take uptime as an argument so that you could sanity check it.
The code that uses that value can be run in an environment where that value can be controlled.
I have written code that does this same thing and built a test harness for it.
these kernel versions:
Darwin Kernel Version 20.6.0: Thu Jul 6 22:12:47 PDT 2023; root:xnu-7195.141.49.702.12~1/RELEASE_ARM64_T8101 arm64
Darwin Kernel Version 17.7.0: Wed Apr 24 21:17:24 PDT 2019; root:xnu-4570.71.45~1/RELEASE_X86_64 x86_64
so... wonder what that's about?
tcp_now = 4,294,960,000 (frozen at pre-overflow value)
timer = 4,294,960,000 + 30,000 = 4,294,990,000
(exceeds uint32 max → wraps to a small number)
timer wraps to a small number, they say TSTMP_GEQ(4294960000, 4294990000)
they forgot to wrap it there, it should be TSTMP_GEQ(4294960000, small_number) = (int)(4294960000 - 4294990000)
= (int)(-30000)
= -30000 >= 0 ? → false!
wrong!There may be a short time period where this bug occurs, and if you get enough TCP connections to TIME_WAIT in that period, they could stick around, maybe. But I think the original post is completely overreacting and was probably written by a LLM, lol.
If tcp_now stops updating at <= 2^32 - 30000 milliseconds, then TSTMP_GEQ(tcp_now, timer) will always fail since timer is tcp_now + 30000 which won't wrap.
This does look like it is possible since calculate_tcp_clock() which updates tcp_now only runs when there's TCP traffic. So if at 49 days uptime you halted all TCP traffic and waited about a day, tcp_now would be stuck at the value before you halted TCP traffic.
In cases where tcp_now gets stuck at > 2^32 - 30000, it looks like TCP sockets in the TIME_WAIT will end up being closed immediately instead of waiting 30 seconds, which isn't great either.
https://github.com/apple-oss-distributions/xnu/blame/f6217f8...
This is a weird thing to cite if it's a macOS 26 bug. I quite regularly go over 50 days of uptime without issues so it makes sense for it to be a new bug, and maybe they had different bugs in the past with similar symptoms.
Current `uptime` on my work MacBook (macOS 15.7.4):
17:14 up 50 days, 22 mins, 16 users, load averages: 2.06 1.95 1.94
Am I supposed to be having issues with TCP connections right now? (I'm not.)My personal iMac is at 279 days of uptime.
They almost never do live that long, for whatever reason, but they should.
At the very least, the writing takes way too long to get to a point.
The longest uptime I have had on any of my recent laptops is probably around 90 days but that’s because that laptop was sitting in my garage with wall power connected (probably bad for the battery) and some external storage connected and I’d remote into that machine over WireGuard now and then. When I did reboot that machine it was only out of habit that I accidentally clicked on reboot via a remote graphical session.
Most of the time my remote use of the laptop in the garage would be ssh sessions, but occasionally I’d use Remote Desktop. Right after I clicked reboot in the Remote Desktop session I realized what mistake I had just done - I have WireGuard set up to start after login. So after the reboot, I was temporarily unable to get back in. As I was in another country I couldn’t just walk over to the garage. But I do have family that could, so I instructed one of them over the phone on how to log in for me so that WireGuard would automatically start back up. You’d think this would happen only once, but I probably had to send family to the garage on my behalf maybe three or four times after me having made the same mistake again.
For the laptops that I actually carry around and plug and unplug things to etc, normal amount of time between reboots for me is somewhere between every 1 and 3 days. Cold boot is plenty fast anyway, so shutting it down after a day of work or when ejecting an external HDD or SSD doesn’t really cost me any noticeable amount of time.
That sounds... a bit paranoid? At least on Linux (Gnome), if I click to "safely remove drive" it actually powers off the drive and stops external mechanical drives from spinning. No useful syncing is going to happen anyway once a hard drive no longer spins. A modern OS should definitely be reliable enough that it can be trusted to properly unmount a drive.
> For the laptops that I actually carry around and plug and unplug things to etc, normal amount of time between reboots for me is somewhere between every 1 and 3 days. Cold boot is plenty fast anyway, so shutting it down after a day of work or when ejecting an external HDD or SSD doesn’t really cost me any noticeable amount of time.
I personally don't reboot my laptop that often, but it's not because of a boot taking too much time. It's because I like to keep state: open applications, open files, terminal emulator sessions, windows on particular virtual desktops, etc.
22:22:45 up 3748 days 21:20, 2 users, load average: 1.42, 1.36, 1.02
It's very funny, I think it's because my laptop battery died and when I replaced it, it had to update the time from 10 years ago? I'm not sure why, as the laptop is from mid-2012.
I thought I had a record going here with my Dell laptop, but I guess you win. After a certain point, I just decided to see how long I can make it go.
torp@machinename ~ % uptime 11:43 up 59 days, 1:22, 4 users, load averages: 2.87 2.69 2.70
Sleep is disabled on that machine and it definitely had networking working fine last night.
Mac Mini M2, Sequoia.
Incidentally my laptop says 75 days uptime, but that one does go to sleep.
That's what the wiki says anyway: [1], and a publication with his name is about huge pages [2]
[1] https://wiki.freebsd.org/AlanCox
[2] https://www.usenix.org/legacy/events/osdi02/tech/full_papers...
"His involvement with Linux began in the early 1990s when he was working on a project that required a stable networking solution. This led him to discover Linux, which was still in its infancy at the time.
Contributions to Linux Kernel
Cox's contributions to the Linux kernel are extensive and far-reaching. He is best known for his work on the Linux networking stack, which was critical in making Linux a viable option for server environments. Cox identified and addressed numerous issues in the kernel's TCP/IP implementation, enhancing its performance and reliability." [0]
"For those not familiar with the Linux kernel contributors, Alan Cox wrote large parts of the networking stack, was the maintainer of the 2.2 branch, and was commonly considered the "second in command" to Linus Torvalds at one point: http://en.wikipedia.org/wiki/Alan_Cox" [1]
"Alan started working on Version 0. There were bugs and problems he could correct. He put Linux on a machine in the Swansea University computer network, which revealed many problems in networking which he sorted out; later he rewrote the networking software. [2]
[0] https://machaddr.substack.com/p/kernel-chronicles-insights-a...
[1] https://news.ycombinator.com/item?id=8548738
[2] https://web.archive.org/web/20200923003028/https://www.swans...
Generally it feels like sometimes you boot into a stable "session" that can go on forever, but often enough you boot in a "session" and something goes wrong quickly and you'll have to reboot after a week or two. But I do experience the same with my Raspberry PI. :)
guess i'm marked safe!
% netstat -an | grep TIME_WAIT | wc -l
850
All other systems with < 49.7 days uptime report low single to double digit numbers.
calc_tcp_overflow_time.fish: https://gist.github.com/daveorzach/64538f82a89fa24e5d134557c...
monitor_tcp_time_wait.fish: https://gist.github.com/daveorzach/0964a7a67c08c50043ff707cf...
God I wish Apple offered first party support for Linux on Mac computers.