Employing QUIC Protocol to Optimize Uber’s App Performance (opens in new tab)

(eng.uber.com)

363 pointsnhf7y ago57 comments

57 comments

42 comments · 11 top-level

ctime7y ago· 8 in thread

This YouTube video does a great job illustrating how well HTTP/2 works in practice.

https://www.youtube.com/watch?v=QCEid2WCszM

A lesser known *ownside to HTTP/2 over TCP solution was actually caused by one of the improvements - a single reusable (multiplexed) connection - that could end up stalled or blocked due to network issues. This behavior could go unnoticed over the legacy HTTP/1.1 connections due to browsers opening a hugh number of connections (~20) to a host, so when one would fail it wouldn't block everything.

youngtaff7y ago

It's an unrealistic test as pages aren't really made up of tiles of images

Generally images are the lowest priority download, so ensuring higher priority items get downloaded first is important and not all H2 implementations do it well

https://ishttp2fastyet.com

baroffoos7y ago

Its only unrealistic because so much tooling was built to avoid sending multiple files. JS tools for bundling every js file in to one, sprite sheets, using multiple domain names to get more concurrent connections. With HTTP2 we could dump so much of this.

2 more replies

hueving7y ago

Priority doesn't help. A stalled tcp connection blocks everything on it.

zamadatix7y ago

Ironic considering YouTube itself is pretty heavily so.

cpitman7y ago

My team ran into a surprising behavior for HTTP2. Browsers decide whether to reuse a connection not based on the original domain the connection was made to, browsers base the decision on the domains that the returned certificate is signed for!

Our current load balancer doesn't support HTTP2 end-to-end (and we are dong gRPC), so we are load balancing TCP connections to the individual instances. And for certificates, we would use SANs to reduce the number of certificates being requested.

Put those two together, and browsers will assume that the first connection they make to serviceA.example.com can also be used for serviceB.example.com. Oops!

TLDR, certificates for HTTP2 need to be unique to each endpoint that terminates a browser connection.

merb7y ago

oh man, you basically are my hero. I just found out https://www.trullala.de/firefox-http2-ipv6-pitfall/ which basically describes the problem.

basically we have a gitlab instance at IP .8 / external at .210 and a nexus instances at .210 internally/externally however we have IPv6 addresses pointed at gitlab.

In firefox sometimes you could end in the wrong location and I had no idea why that was happening. it just failed. And only in firefox.

(btw. the behavior of firefox is just stupid. https://bugzilla.mozilla.org/show_bug.cgi?id=1190136)

xyzzyz7y ago

SNI spec says:

If the server_name is established in the TLS session handshake, the client SHOULD NOT attempt to request a different server name at the application layer.

Yeah, looks like the browsers are allowed to do that, although it's not recommended.

1 more reply

KaiserPro7y ago

The downside, as you point out disproportionately affects mobile and high latency connections.

any kind of packet loss kneecaps the performance of the whole thing, unlike http1.1.

internals7y ago· 5 in thread

What a great case study. Successfully shifting 80% of mobile traffic to QUIC for a 50% reduction in latency is amazing. QUIC and the ongoing work with multipath TCP/QUIC will be huge QoL improvements for mobile networking.

api7y ago

The other awesome thing about QUIC is that it encrypts almost everything including header information, making middlebox traffic shaping worthless and demoting middleboxes in general.

drewg1237y ago

It also makes hardware offloads like TSO and LRO impossible, and increases cost-per-byte served by a factor of 4 or more. So if you have infinite CPU to throw at QUIC and/or low bandwidth or connection targets, its great. If you are concerned at all about server-side efficiency, its terrible.

FWIW, I work on the Netflix CDN, and specialize in server-side efficiency; we have had 100G flash CDN nodes for years serving at 90G+ in production. None of that would be possible with QUIC as it stands. I suspect our max B/W on these machines would drop from ~95Gb/s to 20Gb/s or less if we were to switch to QUIC.

2 more replies

nhfOP7y ago

Can't wait for the IETF standardization process to finish up! https://quicwg.org/base-drafts/draft-ietf-quic-transport.htm...

kevin_thibedeau7y ago

You'll just get more JavaScript piled on as managers insist on having X monetization feature deployed.

woah7y ago

This article is about mobile apps

1 more reply

sly0107y ago· 4 in thread

I wish the mandated minimum MTUs of IP were just a bit bigger. Ubers traffic must be so transactional, they could really just use individual UDP packets for most messaging.

jacob0197y ago

ipv4 - 576 bytes

ipv6 - 1280 bytes

mruts7y ago

That should be sufficient for uber, I think.

1 more reply

IcePic7y ago

Aww, makes SLIP users sad..

toast07y ago

QUIC rfcs proclaim a minimum allowed packet size of 1280. Although proclaiming such things doesn't mean they're reality.

OrgNet7y ago· 4 in thread

This kind of latency improvement only matters if they are planning to do auto-pilot from the cloud? (that would be crazy, especially if they don't have a fallback)

eropple7y ago

Does it necessarily imply that, though? applications that are snappy and responsive tend to improve users' impression of the app and probably (it is/was true for browsers when I worked in this realm, I have no reason to doubt that it's true for mobile today) improve user engagement.

Phlarp7y ago

I think the prevailing wisdom is that engagement suffers more from latency on mobile.

wenttomarket7y ago

Try requesting an Uber in Latam on a low end android device. You’ll revise this opinion quickly.

OrgNet7y ago

that has nothing to do with latency... latency is usually less then 1 second... no matter which phone you are using.

7ewis7y ago· 3 in thread

So is this essentially HTTP/3?

wmf7y ago

Yes. https://tools.ietf.org/html/draft-ietf-quic-http-20

tomchristie7y ago

HTTP/3 is slated as roughly HTTP/2 over QUIC but with some differences as a result. https://tools.ietf.org/html/draft-ietf-quic-http-20#appendix...

I was dissapointed that the article didn't get into any details about the HTTP-level differences, but I think that's basically because they didn't need to know or care about the transport at that level since they were using Google's "cronet" library to dispatch the requests, and Google Cloud Load balancers as the QUIC termination point, so nothing on their ends actually deals with the neccessary differences from HTTP/2.

layoutIfNeeded7y ago

Why 3? Just call it HTTP-as-defined-by-Google. It worked for HTML!

the84727y ago· 3 in thread

Isn't TLP[0] supposed to fix the largest cause (tail losses) of this issue? It should result in retransmits far sooner than the 30 seconds they mention.

> Recently developed algorithms, such as BBR, model the network more accurately and optimize for latency. QUIC lets us enable BBR and update the algorithm as it evolves.

Again this is available for TCP in recent linux kernels[1]. And it's sender-side, so it should be unaffacted by ancient android devices.

Are they using ancient linux kernels on their load balancers? Or are the sysctl knobs for these features turned off in some distros?

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... [1] https://kernelnewbies.org/Linux_4.9#BBR_TCP_congestion_contr...

lossolo7y ago

BBR in kernel works for TCP because it's used for TCP congestion control. QUIC is built on top of UDP and there is no congestion control used for UDP, you implement it in your application/protocol built on top of UDP. From what I remember QUIC implementation supports BBR and Cubic as congestion control mechanisms.

v5c67y ago

Correct. QUIC also implements TLP. Main advantage is deployability since this is a user-space solution.

toast07y ago

If you need it on the client side, it's basically never going to happen on Android. Between manufacturers that never update and Google doesn't seem to care (they never enabled path MTU probing, even though it's been in all the kernels they shipped, and Apple uses it aggressively to great success in terrible networks)

esaym7y ago· 2 in thread

I recently moved and got internet with Spectrum. A 200/10 service yet my upload speeds were rarely above 5mbit. This was a business account with some web and dev servers behind it. I didn't even try to call customer service...

With a little more testing using UDP, I could see I was getting very spotty packetloss (<0.5%). I'd never tried changing the TCP algo before but I knew random packetloss is normally interpreted as congestion and hence causes a speed backoff.

I tried all of the ones available at the time but the one that stood out not only in performance but also simplicity was TCP-Illinois[0]. The stats provided by `ss -i` also seemed the most accurate with TCP-Illinois. I force enable it on every machine I come across now.

0:https://en.wikipedia.org/wiki/TCP-Illinois

nullwasamistake7y ago

I highly recommend BBR congestion control if your router supports it.

lmns7y ago

TCP congestion control is end-to-end. Routers don't need any support for it.

2 more replies

ssvss7y ago· 1 in thread

I thought DDOS prevention was difficult with udp, compared to TCP. Is it not the case anymore. Does cloudflare provide DDOS prevention for QUIC/UDP.

mirashii7y ago

UDP DDOSes are hard to prevent against because they're volumetric in nature. Generally, they rely on UDP's statelessness, poorly configured networks, and various applications that respond with significantly more data than requested over UDP. These types of DDOSes are useful against any service that's running on TCP or UDP, and Cloudflare has protected against them all along.

jefftk7y ago· 1 in thread

I'm surprised the "alternatives considered" section doesn't have a "write something custom for core functionality using UDP". I would be curious to read why the decided not to go that way, given their scale and the potential gains from not using a general-purpose protocol.

(Something like, make the entire standard journey from opening to the app to requesting a car over something custom, and then leave the rest of the app using TCP)

telotortium7y ago

I'm assuming their thought process went like this: "we want to optimize a workflow which today goes over HTTP RPC/REST for mobile networks where the retry behavior of TCP at the low level is suboptimal. QUIC already exists, and we don't have to figure out our own solutions to firewalls, security, HTTP semantics. Also our webapps use the same or similar API endpoints, so the less effort spent writing new routing, monitoring, etc., the better. Oh, it works really well. Awesome, avoided 6-24 months debugging a custom protocol over mobile networks around the world that we don't control."

panarky7y ago

Experiment 1

While we used the NGINX reverse proxy to terminate TCP, it was challenging to find an openly available reverse proxy for QUIC. We built a QUIC reverse proxy in-house using the core QUIC stack from Chromium and contributed the proxy back to Chromium as open source.

Experiment 2

Once Google made QUIC available within Google Cloud Load Balancing, we repeated the same experiment setup with one modification: instead of using NGINX, we used the Google Cloud load balancers to terminate the TCP and QUIC connections...

Since the Google Cloud load balancers terminate the TCP connection closer to users and are well-tuned for performance, the resulting lower RTTs significantly improved the TCP performance.

m3kw97y ago

Tcp was build for the internet long ago, even though there are changes added, the architecture of the protocol make it hard to do anything drastic. With UDP because it is so simple, you can basically create a new protocol on top, inside the payload and emulate TCP if you wanted to

j / k navigate · click thread line to collapse

57 comments

42 comments · 11 top-level

ctime7y ago· 8 in thread

This YouTube video does a great job illustrating how well HTTP/2 works in practice.

https://www.youtube.com/watch?v=QCEid2WCszM

youngtaff7y ago

It's an unrealistic test as pages aren't really made up of tiles of images

Generally images are the lowest priority download, so ensuring higher priority items get downloaded first is important and not all H2 implementations do it well

https://ishttp2fastyet.com

baroffoos7y ago

2 more replies

hueving7y ago

Priority doesn't help. A stalled tcp connection blocks everything on it.

zamadatix7y ago

Ironic considering YouTube itself is pretty heavily so.

cpitman7y ago

Put those two together, and browsers will assume that the first connection they make to serviceA.example.com can also be used for serviceB.example.com. Oops!

TLDR, certificates for HTTP2 need to be unique to each endpoint that terminates a browser connection.

merb7y ago

oh man, you basically are my hero. I just found out https://www.trullala.de/firefox-http2-ipv6-pitfall/ which basically describes the problem.

basically we have a gitlab instance at IP .8 / external at .210 and a nexus instances at .210 internally/externally however we have IPv6 addresses pointed at gitlab.

In firefox sometimes you could end in the wrong location and I had no idea why that was happening. it just failed. And only in firefox.

(btw. the behavior of firefox is just stupid. https://bugzilla.mozilla.org/show_bug.cgi?id=1190136)

xyzzyz7y ago

SNI spec says:

If the server_name is established in the TLS session handshake, the client SHOULD NOT attempt to request a different server name at the application layer.

Yeah, looks like the browsers are allowed to do that, although it's not recommended.

1 more reply

KaiserPro7y ago

The downside, as you point out disproportionately affects mobile and high latency connections.

any kind of packet loss kneecaps the performance of the whole thing, unlike http1.1.

internals7y ago· 5 in thread

api7y ago

The other awesome thing about QUIC is that it encrypts almost everything including header information, making middlebox traffic shaping worthless and demoting middleboxes in general.

drewg1237y ago

2 more replies

nhfOP7y ago

Can't wait for the IETF standardization process to finish up! https://quicwg.org/base-drafts/draft-ietf-quic-transport.htm...

kevin_thibedeau7y ago

You'll just get more JavaScript piled on as managers insist on having X monetization feature deployed.

woah7y ago

This article is about mobile apps

1 more reply

sly0107y ago· 4 in thread

I wish the mandated minimum MTUs of IP were just a bit bigger. Ubers traffic must be so transactional, they could really just use individual UDP packets for most messaging.

jacob0197y ago

ipv4 - 576 bytes

ipv6 - 1280 bytes

mruts7y ago

That should be sufficient for uber, I think.

1 more reply

IcePic7y ago

Aww, makes SLIP users sad..

toast07y ago

QUIC rfcs proclaim a minimum allowed packet size of 1280. Although proclaiming such things doesn't mean they're reality.

OrgNet7y ago· 4 in thread

This kind of latency improvement only matters if they are planning to do auto-pilot from the cloud? (that would be crazy, especially if they don't have a fallback)

eropple7y ago

Phlarp7y ago

I think the prevailing wisdom is that engagement suffers more from latency on mobile.

wenttomarket7y ago

Try requesting an Uber in Latam on a low end android device. You’ll revise this opinion quickly.

OrgNet7y ago

that has nothing to do with latency... latency is usually less then 1 second... no matter which phone you are using.

7ewis7y ago· 3 in thread

So is this essentially HTTP/3?

wmf7y ago

Yes. https://tools.ietf.org/html/draft-ietf-quic-http-20

tomchristie7y ago

HTTP/3 is slated as roughly HTTP/2 over QUIC but with some differences as a result. https://tools.ietf.org/html/draft-ietf-quic-http-20#appendix...

layoutIfNeeded7y ago

Why 3? Just call it HTTP-as-defined-by-Google. It worked for HTML!

the84727y ago· 3 in thread

Isn't TLP[0] supposed to fix the largest cause (tail losses) of this issue? It should result in retransmits far sooner than the 30 seconds they mention.

> Recently developed algorithms, such as BBR, model the network more accurately and optimize for latency. QUIC lets us enable BBR and update the algorithm as it evolves.

Again this is available for TCP in recent linux kernels[1]. And it's sender-side, so it should be unaffacted by ancient android devices.

Are they using ancient linux kernels on their load balancers? Or are the sysctl knobs for these features turned off in some distros?

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... [1] https://kernelnewbies.org/Linux_4.9#BBR_TCP_congestion_contr...

lossolo7y ago

v5c67y ago

Correct. QUIC also implements TLP. Main advantage is deployability since this is a user-space solution.

toast07y ago

esaym7y ago· 2 in thread

0:https://en.wikipedia.org/wiki/TCP-Illinois

nullwasamistake7y ago

I highly recommend BBR congestion control if your router supports it.

lmns7y ago

TCP congestion control is end-to-end. Routers don't need any support for it.

2 more replies

ssvss7y ago· 1 in thread

I thought DDOS prevention was difficult with udp, compared to TCP. Is it not the case anymore. Does cloudflare provide DDOS prevention for QUIC/UDP.

mirashii7y ago

jefftk7y ago· 1 in thread

(Something like, make the entire standard journey from opening to the app to requesting a car over something custom, and then leave the rest of the app using TCP)

telotortium7y ago

panarky7y ago

Experiment 1

Experiment 2

Since the Google Cloud load balancers terminate the TCP connection closer to users and are well-tuned for performance, the resulting lower RTTs significantly improved the TCP performance.

m3kw97y ago

j / k navigate · click thread line to collapse