An argument for increasing TCP's initial congestion window (2024) (opens in new tab)

(jeclark.net)

108 pointscyb0rg010mo ago52 comments

52 comments

28 comments · 11 top-level

kev00910mo ago· 7 in thread

This article is stuck in a very wide valley of being perhaps somewhat familiar with the domain concepts but nowhere near deep enough to draw any of the conclusions being drawn. It is close enough to being completely wrong.

The primary tradeoff of initcwnd is setting a reasonable window before you've learned anything about the path. BBR has little say on this because it takes, in relative terms, quite a while to go through its phases. An early BBR session is therefore not really superior to other congestion controls because that is not the problem it is really focused on.

Jacking up the initcwnd, you start to risk tail loss, which is the worst kind of loss for a sliding window.. especially in the primordial connection. There are ways of trying to deal with all that but they are loss predictions.

If you are a big enough operator, maybe you have some a priori knowledge to jack this up for certain situations. But people are also reckless and do not understand the tradeoffs or overall fairness that the transport community tries to achieve.

As other comments have pointed out, QUIC stacks also replicate congestion control and other algorithms based on the TCP RFCs. These are usually much simpler and lacking features compared to the mainline Linux TCP stack. It's not a free lunch and doesn't obviate the problem space any transport protocol has to make tradeoffs on.

tlb10mo ago

Google has probably sent data to almost every /24 in the last hour. Probably 99% of their egress data goes to destinations where they've sent enough data recently to make a good estimate of bottleneck link speed and queue size.

Having to pick a particular initcwnd to be used for every new TCP connection is an architectural limitation. If they could collect data about each destination and start each TCP connection with a congestion window based on the recent history of transfers from any of their servers to that destination, it could be much better.

It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.

toast010mo ago

> It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.

Tons of fun. Sadly, I don't have access to enough clients to do it anymore.

But napkin architecture. Collect per connection stats and report on connection close (you can do a lot with tcp_info or something something quic). That goes into some big map/reduce whatever data pipeline.

The pipeline ends up with some recommended initial segment limit and a mss suggestion [1], you can probably fit both of those into 8-bits. For ipv4, you could probably just put them into a 16 MB lookup table... shift off the last octet of the address and that's your index into the table. For ipv6 it's trickier, the address space is too big; there's techniques though.

At google scale, they can probably regenerate this data hourly, but weekly would probably be plenty fast.

[1] This is it's own rant (and hopefully it's outdated) but mss on a syn+ack should really start at the lower of what you can accept and what the client told you they can. Instead, consensus has been to always send what you can accept. But path mtu doesn't always work, so a lot of services just send a reduced mss. If you have the infrastructure, it's actually pretty easy to tell if clients can send you full mtu packets or not... with per network data, you could have 4 reasonable options, reflect sender, reflect sender minus 8 (pppoe), reflect sender minus 20 (ipip tunnel), reflect sender minus 28 (ipip tunnel and pppoe). If you have no data for a network, random select.

namibj10mo ago

You're forgetting wireless access networks with varying signal strength (home wifi; phone in the basement).

1 more reply

hinkley10mo ago

Spiders that send too much traffic tend to get blocked, so they are already having to contend with some sort of coordination. Whatever system they’re using for that coordination (server affinity being the simplest) can also propagate the congestion windows.

youngtaff10mo ago

This!

They also miss the fact that even with an initcwnd of 10 the TLS negotiation isn’t going to consume it so the window starts growing long before content is actually sent.

Plus there’s no discussion of things like packet pacing

toast010mo ago

TLS without 0-rtt gets you one round trip of not too many packets, maybe 4 unless your certificates are really large. That helps your initial window for content, but not by a whole lot.

pas10mo ago

can you please explain how packet pacing factors into this?

1 more reply

commandersaki10mo ago· 6 in thread

I reckon bufferbloat is overhyped as a problem, it mattered to a small set of Internet connectivity in the 2010s and promptly went away as connectivity changed and improved, yet we continue to look at it like it was yesterdays problem.

toast010mo ago

Bufferbloat is alive and well. Try a t-mobile 5g home gateway. Oof.

I think cable modems have had a ton of improvement, and more fiber in our diet helps, but mobile can be tricky, and wifi is still highly variable (there's promising signs, but I don't know how many people update their access points)

commandersaki10mo ago

Is it though, or is it just a scapegoat or a red herring, especially in the case with a wireless medium? That's been my experience with quick claims to bufferbloat, it's usually something else at play. But again ymmv.

2 more replies

fulafel10mo ago

Internet connectivity improvement has slowed a lot. It was improving at a good clip in the 00's but then a lot of usage moved to mobile data which also caused investment to shift away from broadband speedups. If we had 00's rate of improvement, people would have 100G connections at home now.

(wifi also dampened bandwidth demand for a long time - it didn't make sense to pay for faster-than-wifi broadband)

BenjiWiebe10mo ago

A relative of mine runs a WISP (800+ customers). He's using LibreQoS to prevent bufferbloat (not its only feature) for his entire network.

bboygravity10mo ago

Someone hasn't travelled a lot outside of the house I see?

Wifi is still mostly shitty in most places in the world.

Then there are countries like Philippines with just all around slow internet everywhere.

commandersaki10mo ago

Never said there isn't connectivity or service quality issues inside or outside the home; just that bufferbloat specifically is a trope that should be put to rest.

wahern10mo ago· 3 in thread

> Google also developed QUIC, which is HTTP over UDP. There’s no longer any congestion window to deal with, so entire messages can be sent at once.

I don't think that's true. QUIC implementations typically use the same congestion control algorithms, including both CUBIC and BBR, at least nominally. The latest RFCs for those discuss use with both TCP and QUIC. Though, perhaps when used with QUIC they have more degrees of freedom to tune things.

01HNNWZ0MV43FF10mo ago

Also QUIC is not HTTP over UDP. HTTP/3 is HTTP over QUIC. QUIC is bidi streams and best-effort datagrams over UDP.

Hilift10mo ago

Vendors are pushing using QUIC and without other networking layers, thus the performance increase. For example, SMB3 over QUIC over the Internet, sans VPN.

dlenski10mo ago

Yeah, the article completely lost me at this point.

A somewhat-interesting, though simplistic, discussion of congestion control in TCP.

And then… "but don't worry, QUIC is magic, and it doesn't need any of this.”

No. No it's not. QUIC is not magic and it basically changes nothing in terms of congestion control.

rayanboulares10mo ago· 1 in thread

Just the day I discovered TCP Congestion Windows and spent the day tweaking and benchmarking between Vegas, Reno, Cubic and TCTP

IncreasePosts10mo ago

I've also tweaked and marked benches in Vegas and Reno, to my great shame.

namibj10mo ago

L4S[0] also helps a lot with sensing congestion before young connections have yet to suffer their first lost packet...

Basically it sqrt's your actual packet loss rate as far as feedback frequency/density is concerned, without actually even typically having to enact that loss. For example, you could get congestion feedback every 100th packet (as you would with 1% packet loss), during network conditions that would traditionally only have 0.01% packet loss. From [1]:

Unless an AQM node schedules application flows explicitly, the likelihood that the AQM drops a Not-ECT Classic packet (p_C) MUST be roughly proportional to the square of the likelihood that it would have marked it if it had been an L4S packet (p_L). That is:

p_C ~= (p_L / k)2

The constant of proportionality (k) does not have to be standardized for interoperability, but a value of 2 is RECOMMENDED. The term 'likelihood' is used above to allow for marking and dropping to be either probabilistic or deterministic.

[0]: https://www.rfc-editor.org/rfc/rfc9330.html [1]: https://www.rfc-editor.org/rfc/rfc9331#name-the-strength-of-...

procaryote10mo ago

This might be one of those induced demand situations like building more lanes on a highway, which generally makes traffic worse.

The actual limiting factor for how horribly bloated frontend code becomes is that at some point, it becomes so bad that it noticably impacts your business negatively, and you need to improve it.

Increasing the TCP window so it managed at least the basic asset delivery makes sense, but if you need to cold start regularly and you have hundreds of kb of javascript, perhaps fix your stuff?

howtofly10mo ago

IIRC, all latency-driven congestion control algorithms suffer from violent rtt variance, which happens frequently in wireless networks. How does BBR perform under such circumstances?

Hnrobert4210mo ago

How would this affect DDOS attacks? Would it make you more vulnerable?

egberts110mo ago

Someone needs to reread all of Sally Floyd, et al. researches from International Computer Science Institute of UC Berkeley.

She touched the TCP congestion algorithms of all TCP variants Tahoe, Reno, New Reno, Carson, Vegas, SACK, Westwood, Illinois, Hybla, Compound, HighSpeed, BIC, CUBIC, DCTCP, BBR, BCP, XCP, RCP.

And it all boils down to:

* how much propagation delay are there,

* how long are each packets, and

* whether there are sufficient storage space for “buffer bloats”.

Also, TCP congestion algorithms are neatly pegged as

* reactive (loss-based)

* proactive (delay-based)

* predictive (bandwidth estimation)

https://egbert.net/blog/articles/tcp-evolution.html

also citations:

DUAL (Wang & Crowcroft, 1992) https://www.cs.wustl.edu/~jain/cis788-95/ftp/tcpip_cong/

TCP Veno (Fu & Liew, 2003) https://www.ie.cuhk.edu.hk/wp-content/uploads/fileadmin//sta... https://citeseerx.ist.psu.edu/document?doi=003084a34929d8d2c...

TCP Nice (Venkataramani, Kokku, Dahlin, 2002) https://citeseerx.ist.psu.edu/document?doi=10.1.1.12.8742

TCP-LP (Low Priority TCP, Kuzmanovic & Knightly, 2003) https://www.cs.rice.edu/~ek7/papers/tcplp.pdf

Scalable TCP (Kelly, 2003) https://www.hep.ucl.ac.uk/~ytl/talks/scalable-tcp.pdf

H-TCP (Leith & Shorten, 2004) https://www.hamilton.ie/net/htcp/

FAST TCP (Jin, Wei, Low, 2004/2005) https://netlab.caltech.edu/publications/FAST-TCP.pdf

TCP Africa (King, Baraniuk, Riedi, 2005) https://www.cs.rice.edu/~ied/comp600/PROJECTS/Africa.pdf

TCP Libra (Marfia, Palazzi, Pau, Gerla, Sanadidi, Roccetti, 2007) https://www.cs.ucla.edu/NRL/hpi/tcp-libra/

YeAH-TCP (Yet Another High-speed TCP, Baiocchi, Castellani, Vacirca, 2007) https://dl.acm.org/doi/10.1145/1282380.1282391

TCP-Nice and other background CCAs https://en.wikipedia.org/wiki/TCP_congestion_control

TCP-FIT (Wang, 2016) https://www.sciencedirect.com/science/article/abs/pii/S10848...

jddunce10mo ago

I seem to remember this coming up a few times over the years and it’s always bad iirc.

cyb0rg0OP10mo ago

Google has a long history of performing networking research, making changes, and pushing those changes to the entire internet. In 2011, they published one of my favorite papers, which described their decision to increase the TCP initial congestion window from 1 to 10 on their entire infrastructure.

1 more reply

j / k navigate · click thread line to collapse

52 comments

28 comments · 11 top-level

kev00910mo ago· 7 in thread

tlb10mo ago

It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.

toast010mo ago

> It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.

Tons of fun. Sadly, I don't have access to enough clients to do it anymore.

At google scale, they can probably regenerate this data hourly, but weekly would probably be plenty fast.

namibj10mo ago

You're forgetting wireless access networks with varying signal strength (home wifi; phone in the basement).

1 more reply

hinkley10mo ago

youngtaff10mo ago

This!

They also miss the fact that even with an initcwnd of 10 the TLS negotiation isn’t going to consume it so the window starts growing long before content is actually sent.

Plus there’s no discussion of things like packet pacing

toast010mo ago

TLS without 0-rtt gets you one round trip of not too many packets, maybe 4 unless your certificates are really large. That helps your initial window for content, but not by a whole lot.

pas10mo ago

can you please explain how packet pacing factors into this?

1 more reply

commandersaki10mo ago· 6 in thread

toast010mo ago

Bufferbloat is alive and well. Try a t-mobile 5g home gateway. Oof.

commandersaki10mo ago

2 more replies

fulafel10mo ago

(wifi also dampened bandwidth demand for a long time - it didn't make sense to pay for faster-than-wifi broadband)

BenjiWiebe10mo ago

A relative of mine runs a WISP (800+ customers). He's using LibreQoS to prevent bufferbloat (not its only feature) for his entire network.

bboygravity10mo ago

Someone hasn't travelled a lot outside of the house I see?

Wifi is still mostly shitty in most places in the world.

Then there are countries like Philippines with just all around slow internet everywhere.

commandersaki10mo ago

Never said there isn't connectivity or service quality issues inside or outside the home; just that bufferbloat specifically is a trope that should be put to rest.

wahern10mo ago· 3 in thread

> Google also developed QUIC, which is HTTP over UDP. There’s no longer any congestion window to deal with, so entire messages can be sent at once.

01HNNWZ0MV43FF10mo ago

Also QUIC is not HTTP over UDP. HTTP/3 is HTTP over QUIC. QUIC is bidi streams and best-effort datagrams over UDP.

Hilift10mo ago

Vendors are pushing using QUIC and without other networking layers, thus the performance increase. For example, SMB3 over QUIC over the Internet, sans VPN.

dlenski10mo ago

Yeah, the article completely lost me at this point.

A somewhat-interesting, though simplistic, discussion of congestion control in TCP.

And then… "but don't worry, QUIC is magic, and it doesn't need any of this.”

No. No it's not. QUIC is not magic and it basically changes nothing in terms of congestion control.

rayanboulares10mo ago· 1 in thread

Just the day I discovered TCP Congestion Windows and spent the day tweaking and benchmarking between Vegas, Reno, Cubic and TCTP

IncreasePosts10mo ago

I've also tweaked and marked benches in Vegas and Reno, to my great shame.

namibj10mo ago

L4S[0] also helps a lot with sensing congestion before young connections have yet to suffer their first lost packet...

p_C ~= (p_L / k)2

[0]: https://www.rfc-editor.org/rfc/rfc9330.html [1]: https://www.rfc-editor.org/rfc/rfc9331#name-the-strength-of-...

procaryote10mo ago

This might be one of those induced demand situations like building more lanes on a highway, which generally makes traffic worse.

The actual limiting factor for how horribly bloated frontend code becomes is that at some point, it becomes so bad that it noticably impacts your business negatively, and you need to improve it.

Increasing the TCP window so it managed at least the basic asset delivery makes sense, but if you need to cold start regularly and you have hundreds of kb of javascript, perhaps fix your stuff?

howtofly10mo ago

IIRC, all latency-driven congestion control algorithms suffer from violent rtt variance, which happens frequently in wireless networks. How does BBR perform under such circumstances?

Hnrobert4210mo ago

How would this affect DDOS attacks? Would it make you more vulnerable?

egberts110mo ago

Someone needs to reread all of Sally Floyd, et al. researches from International Computer Science Institute of UC Berkeley.

She touched the TCP congestion algorithms of all TCP variants Tahoe, Reno, New Reno, Carson, Vegas, SACK, Westwood, Illinois, Hybla, Compound, HighSpeed, BIC, CUBIC, DCTCP, BBR, BCP, XCP, RCP.

And it all boils down to:

* how much propagation delay are there,

* how long are each packets, and

* whether there are sufficient storage space for “buffer bloats”.

Also, TCP congestion algorithms are neatly pegged as

* reactive (loss-based)

* proactive (delay-based)

* predictive (bandwidth estimation)

https://egbert.net/blog/articles/tcp-evolution.html

also citations:

DUAL (Wang & Crowcroft, 1992) https://www.cs.wustl.edu/~jain/cis788-95/ftp/tcpip_cong/

TCP Veno (Fu & Liew, 2003) https://www.ie.cuhk.edu.hk/wp-content/uploads/fileadmin//sta... https://citeseerx.ist.psu.edu/document?doi=003084a34929d8d2c...

TCP Nice (Venkataramani, Kokku, Dahlin, 2002) https://citeseerx.ist.psu.edu/document?doi=10.1.1.12.8742

TCP-LP (Low Priority TCP, Kuzmanovic & Knightly, 2003) https://www.cs.rice.edu/~ek7/papers/tcplp.pdf

Scalable TCP (Kelly, 2003) https://www.hep.ucl.ac.uk/~ytl/talks/scalable-tcp.pdf

H-TCP (Leith & Shorten, 2004) https://www.hamilton.ie/net/htcp/

FAST TCP (Jin, Wei, Low, 2004/2005) https://netlab.caltech.edu/publications/FAST-TCP.pdf

TCP Africa (King, Baraniuk, Riedi, 2005) https://www.cs.rice.edu/~ied/comp600/PROJECTS/Africa.pdf

TCP Libra (Marfia, Palazzi, Pau, Gerla, Sanadidi, Roccetti, 2007) https://www.cs.ucla.edu/NRL/hpi/tcp-libra/

YeAH-TCP (Yet Another High-speed TCP, Baiocchi, Castellani, Vacirca, 2007) https://dl.acm.org/doi/10.1145/1282380.1282391

TCP-Nice and other background CCAs https://en.wikipedia.org/wiki/TCP_congestion_control

TCP-FIT (Wang, 2016) https://www.sciencedirect.com/science/article/abs/pii/S10848...

jddunce10mo ago

I seem to remember this coming up a few times over the years and it’s always bad iirc.

cyb0rg0OP10mo ago

1 more reply

j / k navigate · click thread line to collapse