Linux network performance parameters (opens in new tab)

(github.com)

458 pointsdreampeppers992y ago107 comments

107 comments

A random thing I ran into with the defaults (Ubuntu Linux):

- net.ipv4.tcp_rmem ~ 6MB

- net.core.rmem_max ~ 1MB

So.. the tcp_rmem value overrides by default, meaning that the TCP receive window for a vanilla TCP socket actually goes up to 6MB if needed (in reality - 3MB because of the halving, but let's ignore that for now since it's a constant).

But if I "setsockopt SO_RCVBUF" in a user-space application, I'm actually capped at a maximum 1MB, even though I already have 6MB. If I try to reduce it from 6MB to e.g. 4MB, it will result in 1MB. This seems very strange. (Perhaps I'm holding it wrong?)

(Same applies to SO_SNDBUF/wmem...)

To me, it seems like Linux is confused about the precedence order of these options. Why not have core.rmem_max be larger and the authoritative directive? Is there some historical reason for this?

the84722y ago

> (Same applies to SO_SNDBUF/wmem...)

If you want to limit the amount of excess buffered data you can lower TCP_NOTSENT_LOWAT instead, which caps the amount that is buffered beyond what's needed for the BDP.

pengaru2y ago

net.ipv4.tcp_rmem max is a limit for the auto-tuning the kernel performs

once you do SO_RCVBUF the auto-tuning is out of the picture for that socket, and net.core.rmem_max becomes the max.

It's pretty clearly documented @ Documentation/networking/ip-sysctl.rst

Edit: downvotes, really? smh

LukeShu2y ago

1. While your context about auto-tuning is accurate and valuable, it doesn't really address the fundamental strangeness that the parent post is commenting about: It's still strange that it can auto-tune to a higher value than you can manually tune it to.

2. It's always valuable to provide further references, but I'd guess that down-voters found the "It's pretty clearly documented" phrasing a little condescending? Perhaps "See the docs at [] for more information."?

3. "Please don't comment about the voting on comments. It never does any good, and it makes boring reading."

1 more reply

klabb32y ago

> once you do SO_RCVBUF the auto-tuning is out of the picture for that socket

Oh I didn’t realize this. That explains the switch in limits. However:

I would have liked to keep auto-tuning, but only change the max buffer size. It’s still weird to me that these are different modes with different limits and whatnot. In my case, I was parallelizing tcp and capping the max size would have been better, and instead varying the number of conns.

I gave up on it. Especially since I need cross platform user-space only, I don’t want to fiddle with these APIs that are all different and unpredictable. I guess it’s for the best anyway, to avoid as much per-platform hacks as possible.

> It's pretty clearly documented @ Documentation/networking/ip-sysctl.rst

I guess I need to step up my doc grepping game, cause it was quite hard to even find this on Google. I ran my own experiments to verify.

> Edit: downvotes, really? smh

Fwiw not me.

dekhn2y ago

And to add: the kernel autotunes better than you can, so leave that enabled unless you're Vint Cert, Jim Gettys, or Vern Paxton.

1 more reply

gjulianm2y ago

This is great, not just the parameters themselves but all the steps that a packet follows from the point it enters the NIC until it gets to userspace.

Just one thing to add regarding network performance: if you're working in a system with multiple CPUs (which is usually the case in big servers), check NUMA allocation. Sometimes the network card will be in one CPU while the application is executing on a different one, and that can affect performance too.

suprjami2y ago

Packagecloud have a great article series which goes into much more detail and with code study. If you really want to learn the network send and receive, these are the articles to read:

https://blog.packagecloud.io/monitoring-tuning-linux-network...

https://packagecloud.io/blog/illustrated-guide-monitoring-tu...

napkin2y ago

Just changing Linux's default congestion control (net.ipv4.tcp_congestion_control) to 'bbr' can make a _huge_ difference in some scenarios, I guess over distances with sporadic packet loss and jitter, and encapsulation.

Over the last year, I was troubleshooting issues with the following connection flow:

client host <-- HTTP --> reverse proxy host <-- HTTP over Wireguard --> service host

On average, I could not get better than 20% theoretical max throughput. Also, connections tended to slow to a crawl over time. I had hacky solutions like forcing connections to close frequently. Finally switching congestion control to 'bbr' gives close to theoretical max throughput and reliable connections.

I don't really understand enough about TCP to understand why it works. The change needed to be made on both sides of Wireguard.

drewg1232y ago

The difference is that BBR does not use loss as a signal of congestion. Most TCP stacks will cut their send windows in half (or otherwise greatly reduce them) at the first sign of loss. So if you're on a lossy VPN, or sending a huge burst at 1Gb/s on a 10Mb/s VPN uplink, TCP will normally see loss, and back way off.

BBR tries to find Bottleneck Bandwidth rate. Eg, the bandwidth of the narrowest or most congested link. It does this by measuring the round trip time, and increasing the transmit rate until the RTT increases. When the RTT increases, the assumption is that a queue is building at the narrowest portion of the path and the increase of RTT is proportional to the queue depth. It then drops rate until the RTT normalizes due to the queue draining. It sends at that rate for a period of time, and then slightly increases the rate to see if RTT increases again (if not, it means that the queuing that saw before was due to competing traffic which has cleared).

I upgraded from a 10Mb/s cable uplink to 1Gb/s symmetrical fiber a few years ago. When I did so, I was ticked that my upload speed on my corp. VPN remained at 5Mb/s or so. When I switched to RACK TCP (or BBR) on FreeBSD, my upload went up by a factor of 8 or so, to about 40Mb/s, which is the limit of the VPN.

heybrendan2y ago

You seem quite knowledgeable in this domain. Have you authored any blog posts to expand on this topic? I would welcome the chance to learn more from you.

1 more reply

fulafel2y ago

> Most TCP stacks will cut their send windows in half (or otherwise greatly reduce them) at the first sign of loss.

This was obsoleted by fast retransmit which was standardized in the 90s and ~everyone uses, right?

(loss generally is still used as a congestion signal, but first loss is usually not)

1 more reply

napkin2y ago

Thanks for the explanation!

suprjami2y ago

> BBR

Please stop. BBRv1 is broken and should not be used on the open internet.

This sort of copy-paste cargo-cult performance tuning (just set a magical value and things will be better) is the exact opposite of what TFA is about.

Thankfully Google are upstreaming BBRv3 so this will be over soon.

zamadatix2y ago

Yes, BBRv1 has fairness issues when used at scale vs certain other algorithms. No, that doesn't mean people finding it tunes their performance 5x in a particular use case with high latency and some loss should stop talking about how it helps in that scenario. YouTube ran it without the internet burning to the ground so using it in these niche kinds of cases with personal tuning almost certainly results in a net good even though the algorithm wasn't perfect. BBRv3 will make it scale to everyone with better fairness for sure but BBR is still much better behaving for fairness than almost any UDP stream anyways.

fragmede2y ago

The original cargo cultists built runways on islands to cause supplies to be dropped off. It didn't work. If someone copy and pasted something they don't fully understand off the Internet, but it works, can you really blame them for it, or call them cargo cultists?

1 more reply

Matthias2472y ago

> The change needed to be made on both sides of Wireguard.

Congestion control works from the sender of data to the receiver. You don't need to switch both sides if you are just interested in improving performance in one direction.

Besides that, I agree to what others said BBRv1. The cubic implementation in the Linux kernel works really nice for most applications.

doctorpangloss2y ago

Does performance tuning for Wi-Fi adapters matter?

On desktops, other than disabling features, can anything fix the problems with i210 and i225 ethernet? Those seem to be the two most common NICs nowadays.

I don't really understand why common networking hardware and drivers are so flawed. There is a lot of attention paid to RISC-V. How about start with a fully open and correct NIC? They'll shove it in there if it's cheaper than an i210. Or maybe that's impossible.

elabajaba2y ago

> Does performance tuning for Wi-Fi adapters matter?

If you're willing to potentially sacrifice 10-20% of (max local network) throughput you can drastically improve wifi fairness and improve ping times/reduce bufferbloat (random ping spikes will still happen on wifi though).

There's a huge thread https://forum.openwrt.org/t/aql-and-the-ath10k-is-lovely/590... that has stuff about enabling and tuning aqm, and some of the tradeoffs between throughput and latency.

jeffbee2y ago

i225 is just broken but I get excellent performance from i210. 1gb is hardly challenging on a contemporaneous CPU, and the i210 offers 4 queues. What's your beef with i210?

doctorpangloss2y ago

There are a lot of problems with the i210. Here’s a sample:

https://www.google.com/search?q=i210+proxmox+e1000e+disable

Most people don’t really use their NICs “all the time” “with many hosts.” The i210 in particular will hang after a few months of e.g. etcd cluster traffic on 9th and 10th gen Intel, which is common for SFFPCs.

On Windows, the ndis driver works a lot better. Many disconnects in similar traffic load as Linux, and features like receive side coalescing are broken. They also don’t provide proper INFs for Windows server editions, just because.

I assume Intel does all of this on purpose. I don’t think their functionally equivalent server SKUs are this broken.

Apparently the 10Gig patents are expiring very soon. That will make Realtek, Broadcom and Aquantia’s chips a lot cheaper. IMO, motherboards should be much smaller, shipping with BMC and way more rational IO: SFP+, 22110, Oculink, U.2, and PCIe spaced for Infinity Fabric & NVLink. Everyone should be using LVFS for firmware - NVMe firmware, despite being standardized to update, is a complete mess with bugs on every major controller.

I share all of this as someone with experience in operating commodity hardware at scale. People are so wasteful with their hardware.

1 more reply

trustingtrust2y ago

There are 3 revisions of i225 and Intel essentially got rid of it and launched i226. That one also seems to be problematic [1] . Why is it exponentially harder to make a 2.5gbps NIC when the 1gbps NIC (i210 and i211) has worked well for them. Shouldn't it be trivial to make it 2.5x? They seem to make good 10gbps NICs so I would assume 2.5gbps shouldn't need a 5th try from intel ?

[1] - https://shorturl.at/esCNP

2 more replies

suprjami2y ago

I have a motherboard with an i225 onboard.

I bought a PCIe I350. That's solved the problem.

teleforce2y ago

Great overview of the Linux network queues as provided in the Figure, should paste it on the wall somewhere.

Brendan's System Performance books provide nice coverage on Linux network performance and more [1]. It's already in the second edition, both are excellent books but the 2nd edition focuses mainly on Linux whereas the 1st edition also include Solaris.

There's also a more recent book on BPF Performance Tools by him [2].

[1] Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)

https://www.brendangregg.com/systems-performance-2nd-edition...

[2] BPF Performance Tools:

https://www.brendangregg.com/bpf-performance-tools-book.html

zartstrom2y ago

I enjoyed skimming through the article. Very well researched and presented.

But can anybody help me out, who tunes linux network parameters on a regular basis?

raggi2y ago

This doc kinda needs to say "TCP" somewhere, as it's very focused on TCP concerns - which is useful, people are mostly using TCP. The default UDP tunings are awfully low and as such are notably missing.

leshow2y ago

Do you have any good resources for UDP tuning?

sophacles2y ago

I have gotten quite a bit of mileage out of this slide deck: https://events.static.linuxfound.org/sites/events/files/slid...

It's older so some details have changed over time, but the concepts are still relevant. It also has a lot of useful search terms to get you started.

BitPirate2y ago

The typical painpoint is the low maximum buffer sizes.

net.core.rmem_max net.core.wmem_max

e.g. wireguard-go will hit those limits of not executed with CAP_NET_ADMIN.

freedomben2y ago

Could anyone recommend a video or video series covering similar material?

There lots on networking in general, but I've had a hard time finding some on Linux specific implementation

8K832d7tNmiQ2y ago

I'm also seconding this, but from microcontroller perspective.

I want to try developing a simple tcp echo server for a microcontroller, but most examples just use the vendor's own tcp library and put no effort explaining how to manually setup and establish connection to the router.

hanikesn2y ago

Because implementing TCP not just as a toy is incredibly difficult with tripwires that even subject matter experts struggle to get it right without decades of in field testing.

patmorgan232y ago

Well you can always read the standard

cryptonector2y ago

Nothing about PMTUD?

toast02y ago

FWIW, I put together a PMTUD test site you might find interesting http://pmtud.enslaves.us/

cryptonector2y ago

Nice URL!

zamadatix2y ago

For TCP sockets I'd rather just MSS clamp on the internet gateway. On top of too many things just dropping PMTUD, enabling it results in a slower process while MSS clamping hijacks the initial TCP open messages directly.

cryptonector2y ago

There's PMTUD that doesn't depend on ICMP.

1 more reply

j / k navigate · click thread line to collapse

107 comments

klabb32y ago

A random thing I ran into with the defaults (Ubuntu Linux):

- net.ipv4.tcp_rmem ~ 6MB

- net.core.rmem_max ~ 1MB

(Same applies to SO_SNDBUF/wmem...)

To me, it seems like Linux is confused about the precedence order of these options. Why not have core.rmem_max be larger and the authoritative directive? Is there some historical reason for this?

the84722y ago

> (Same applies to SO_SNDBUF/wmem...)

If you want to limit the amount of excess buffered data you can lower TCP_NOTSENT_LOWAT instead, which caps the amount that is buffered beyond what's needed for the BDP.

pengaru2y ago

net.ipv4.tcp_rmem max is a limit for the auto-tuning the kernel performs

once you do SO_RCVBUF the auto-tuning is out of the picture for that socket, and net.core.rmem_max becomes the max.

It's pretty clearly documented @ Documentation/networking/ip-sysctl.rst

Edit: downvotes, really? smh

LukeShu2y ago

3. "Please don't comment about the voting on comments. It never does any good, and it makes boring reading."

1 more reply

klabb32y ago

> once you do SO_RCVBUF the auto-tuning is out of the picture for that socket

Oh I didn’t realize this. That explains the switch in limits. However:

> It's pretty clearly documented @ Documentation/networking/ip-sysctl.rst

I guess I need to step up my doc grepping game, cause it was quite hard to even find this on Google. I ran my own experiments to verify.

> Edit: downvotes, really? smh

Fwiw not me.

dekhn2y ago

And to add: the kernel autotunes better than you can, so leave that enabled unless you're Vint Cert, Jim Gettys, or Vern Paxton.

1 more reply

gjulianm2y ago

This is great, not just the parameters themselves but all the steps that a packet follows from the point it enters the NIC until it gets to userspace.

suprjami2y ago

Packagecloud have a great article series which goes into much more detail and with code study. If you really want to learn the network send and receive, these are the articles to read:

https://blog.packagecloud.io/monitoring-tuning-linux-network...

https://packagecloud.io/blog/illustrated-guide-monitoring-tu...

napkin2y ago

Over the last year, I was troubleshooting issues with the following connection flow:

client host <-- HTTP --> reverse proxy host <-- HTTP over Wireguard --> service host

I don't really understand enough about TCP to understand why it works. The change needed to be made on both sides of Wireguard.

drewg1232y ago

heybrendan2y ago

You seem quite knowledgeable in this domain. Have you authored any blog posts to expand on this topic? I would welcome the chance to learn more from you.

1 more reply

fulafel2y ago

> Most TCP stacks will cut their send windows in half (or otherwise greatly reduce them) at the first sign of loss.

This was obsoleted by fast retransmit which was standardized in the 90s and ~everyone uses, right?

(loss generally is still used as a congestion signal, but first loss is usually not)

1 more reply

napkin2y ago

Thanks for the explanation!

suprjami2y ago

> BBR

Please stop. BBRv1 is broken and should not be used on the open internet.

This sort of copy-paste cargo-cult performance tuning (just set a magical value and things will be better) is the exact opposite of what TFA is about.

Thankfully Google are upstreaming BBRv3 so this will be over soon.

zamadatix2y ago

fragmede2y ago

1 more reply

Matthias2472y ago

> The change needed to be made on both sides of Wireguard.

Congestion control works from the sender of data to the receiver. You don't need to switch both sides if you are just interested in improving performance in one direction.

Besides that, I agree to what others said BBRv1. The cubic implementation in the Linux kernel works really nice for most applications.

doctorpangloss2y ago

Does performance tuning for Wi-Fi adapters matter?

On desktops, other than disabling features, can anything fix the problems with i210 and i225 ethernet? Those seem to be the two most common NICs nowadays.

elabajaba2y ago

> Does performance tuning for Wi-Fi adapters matter?

There's a huge thread https://forum.openwrt.org/t/aql-and-the-ath10k-is-lovely/590... that has stuff about enabling and tuning aqm, and some of the tradeoffs between throughput and latency.

jeffbee2y ago

i225 is just broken but I get excellent performance from i210. 1gb is hardly challenging on a contemporaneous CPU, and the i210 offers 4 queues. What's your beef with i210?

doctorpangloss2y ago

There are a lot of problems with the i210. Here’s a sample:

https://www.google.com/search?q=i210+proxmox+e1000e+disable

I assume Intel does all of this on purpose. I don’t think their functionally equivalent server SKUs are this broken.

I share all of this as someone with experience in operating commodity hardware at scale. People are so wasteful with their hardware.

1 more reply

trustingtrust2y ago

[1] - https://shorturl.at/esCNP

2 more replies

suprjami2y ago

I have a motherboard with an i225 onboard.

I bought a PCIe I350. That's solved the problem.

teleforce2y ago

Great overview of the Linux network queues as provided in the Figure, should paste it on the wall somewhere.

There's also a more recent book on BPF Performance Tools by him [2].

[1] Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)

https://www.brendangregg.com/systems-performance-2nd-edition...

[2] BPF Performance Tools:

https://www.brendangregg.com/bpf-performance-tools-book.html

zartstrom2y ago

I enjoyed skimming through the article. Very well researched and presented.

But can anybody help me out, who tunes linux network parameters on a regular basis?

raggi2y ago

leshow2y ago

Do you have any good resources for UDP tuning?

sophacles2y ago

I have gotten quite a bit of mileage out of this slide deck: https://events.static.linuxfound.org/sites/events/files/slid...

It's older so some details have changed over time, but the concepts are still relevant. It also has a lot of useful search terms to get you started.

BitPirate2y ago

The typical painpoint is the low maximum buffer sizes.

net.core.rmem_max net.core.wmem_max

e.g. wireguard-go will hit those limits of not executed with CAP_NET_ADMIN.

freedomben2y ago

Could anyone recommend a video or video series covering similar material?

There lots on networking in general, but I've had a hard time finding some on Linux specific implementation

8K832d7tNmiQ2y ago

I'm also seconding this, but from microcontroller perspective.

hanikesn2y ago

Because implementing TCP not just as a toy is incredibly difficult with tripwires that even subject matter experts struggle to get it right without decades of in field testing.

patmorgan232y ago

Well you can always read the standard

cryptonector2y ago

Nothing about PMTUD?

toast02y ago

FWIW, I put together a PMTUD test site you might find interesting http://pmtud.enslaves.us/

cryptonector2y ago

Nice URL!

zamadatix2y ago

cryptonector2y ago

There's PMTUD that doesn't depend on ICMP.

1 more reply

j / k navigate · click thread line to collapse