GLB: GitHub's open source load balancer (opens in new tab)

(githubengineering.com)

478 pointscsteinbe7y ago62 comments

62 comments

39 comments · 13 top-level

KenanSulayman7y ago· 6 in thread

Why would one use this over HAProxy?

HAProxy is an Layer 7 (i.e. HTTP, for the most part) load balancer and only handles the use case of spreading load across multiple backends. A single instance binds to a single IP. There's no redundancy; lose the HAProxy and you lose traffic.

For true redundancy, you need a layer above that handles the distribution of traffic to multiple redundant load balancer instances, and GLB does that via ECMP (Equal-Cost Multi-Path) routing. Github supposedly uses HAProxy as their L7 load balancer.

All of this thoroughly explained in the article.

llama0527y ago

I thought best practice for HAproxy was to run two HAProxy's in parallel with VRRP or DNS load balancing?

Does that not achieve the same outcome? I've used HAProxy for layer 4 in the past without any issue this way.

2 more replies

KenanSulayman7y ago

I have very successfully in the past and still do use HAProxy as level 4 LB. It's one of the fastest to my knowledge. I have used HAProxy as entry to big Mesos clusters without any issue before.

One example of using HAProxy as L4 LB instead of letting it do the termination is when it is proxying TLS traffic from and to multiple backends. Or Websocket. Or even as bastion LB for SSH should one bastion go down.

1 more reply

parthdesai7y ago

>GLB Director does not replace services like haproxy and nginx, but rather is a layer in front of these services (or any TCP service) that allows them to scale across multiple physical machines without requiring each machine to have unique IP addresses.

It's right there in 2nd paragraph.

amanzi7y ago

From the article:

KenanSulayman7y ago

Yes, I get that. But in Mesos, for instance, you could use overlay-networks to get one virtual IP per HAProxy cluster (i.e. 3 HAProxy instances). They are then round-robin'd on each server in the whole cluster so that you have a transparent LB with a singular IP.

1 more reply

justinsaccount7y ago· 5 in thread

Google: https://cloudplatform.googleblog.com/2016/03/Google-shares-s...

The design of all 3 is very similar.

lmb7y ago

I haven't looked at what Google has released, but there are big differences between GLB and Katran. (Not affiliated with any of those companies)

In terms of technology, Katran uses XDP and IPIP tunnels, both upstream in the Linux kernel. GLB uses DPDK, which allows processing raw packets in user space, and Geberic UDP encapsulation + a custom iptables module. Neither DPDK nor the module are upstream.

There are architectural differences as well. Katran is much closer to a classic load balancer, and uses connection tracking at the load balancer to know where to send packets for established flows. GLB has no per-flow state at the load balancer, which gives it the very nice property that load balancers can be added and removed from an ECMP group without disturbing existing connections. There is an academic paper about a system called Beamer, which most likely influenced GLB (or maybe the other way round?). It's a good read, and relatively short.

Finally, Katran is really a C++ library you could build a load balancer on, while GLB comes with batteries included.

I think GLB looks nice, hats off to GitHub for open sourcing it.

justinsaccount7y ago

> uses connection tracking at the load balancer to know where to send packets for established flows

Kind of, from katran post:

> Each L4LB also stores the backend choice for each 5-tuple as a lookup table to avoid duplicate computation of the hash on future packets. This state is a pure optimization and is not necessary for correctness

> ..

> Katran uses an extended version of the Maglev hash to select the backend server

1 more reply

notyourday7y ago

It is unfortunate that Fastly did not open source faild. That's the most elegant solution to draining that I have ever seen.

2 more replies

tehnerd7y ago

katran has example of how to use this library. which is (example) more like ipvsadm + kernel.

ded317y ago

The Github folks could have used XDP similar as with Katran, I think their solution is very elegant in that the same node would still be able to process other workloads/jobs which is one of the reasons why Facebook decided against DPDK:

https://atscaleconference.com/videos/networking-scale-2018-l...

jsiepkes7y ago· 5 in thread

Looks really cool! Though a simpler solution for most people will probably be OpenBSD's CARP protocol to share a single virtual IP between multiple boxes (with for example relayd). ECMP routing can get complex fast.

vbernat7y ago

CARP and the likes rely on the presence of an L2 layer which means you get limited scale or increased risk of a global outage. An L3 design scales very well and is resilient (notably because of the distributed control plane and the inability to create a loop).

drewpc7y ago

Also note that CARP, VRRP, HSRP, and GLBP are all layer 3 load balancing protocols (that rely on layer 2 magic to work) whereas this is a layer 4 load balancer. Meaning I can connect to a single IP and a single port (80) and get consistently and equitably load balanced to exactly one server (of many) at another IP address.

SEJeff7y ago

Or VRRP with the open source keepalived, which has been around for a decade+ and works wonderfully on Linux.

gbrayut7y ago

That is exactly what Stack Overflow uses: keepalived to manage a virtual IP between two decent sized baremetal HAProxy servers (w/ bonded 10G nics). Works great and combined with DNS or Anycast based load balancing can scale pretty damn well. Definitely worth investigating as a KISS approach. To quote a recent Atwood tweet:

"if I have learned anything in my career, it is the shocking effectiveness of building ... literally the stupidest thing that could work. (And then iterating on it for a decade.)"

https://twitter.com/codinghorror/status/1026332543153389569

1 more reply

ahoka7y ago

Unfortunately it is not an option on some cloud providers, because you don't get true layer 2.

emmericp7y ago· 2 in thread

Cool use of SR-IOV, I like it. We've done a few (academic) experiments with SR-IOV for flow bifurcation and we've wondered why no one seems to use it like this. The performance was quite good: neglible performance difference between PF and a single VF and only 5-10% when running multiple >= 8 VFs (probably cache contention somewhere in our specific setup).

You seem to be running this on X540 NICs, aren't you running into limitations for the VFs. Mostly the number of queues which I believe is limited to 2 per VF in the ixgbe family. I wonder whether the AF_XDP DPDK driver could be used instead if SR-IOV isn't available or feasible for some reason.

A more detailed look at performance would have been cool. I might try it myself if I find some time (or a student) :)

theojulienne7y ago

We found that we could achieve 10G line rate with just the queues available to the VF, the NIC didn't seem to be a bottleneck providing DPDK was processing packets faster than line rate. It's worth noting that other traffic on the PF was/is minimal in our setup.

We tested this using DPDK pktgen on a identically-configured node (GLB Director and pktgen both using DPDK on a VF with flow bifurcation, on 2 separate machines on the same rack/switch), with GLB Director essentially acting as a reflector back to the pktgen node. pktgen was able to generate enough 40 byte TCP packets to saturate 10G with 2 TX cores/queues, and GLB Director was able to process those packets and encapsulate them with a sizeable set of binds/tables with 3 cores doing work (encapsulation) and 1 core doing RX/distribution/TX.

emmericp7y ago

Yeah, 10G just isn't that much nowadays. And bigger NICs have more features in the VFs.

I've just built a quick test setup:

* two directly connected servers

* 6 core 2.4 GHz CPU

* XL710 40G NICs

* My packet generator MoonGen: https://github.com/emmericp/MoonGen with a quick & dirty modification to l3-tcp-syn-flood.lua to change dst mac

Got these results for 1-5 worker threads in Mpps: 3.84, 6.65, 10.17, 11.57, 11.3.

~10 Mpps is about 10G line rate for the encapsulated packets; this seems a little bit slower than I expected and it looks like I might be hitting the bottleneck of the distributor at 4 worker threads. Didn't look into anything in detail here (spent maybe 30 minutes for setup + tests), but we've done some VXLAN stuff in the past which I recall being faster.

bogomipz7y ago· 2 in thread

I am having trouble understanding this passage. I'm wondering is someone could help me understand this as it seems like an important design detail:

>"Another benefit to using UDP is that the source port can be filled in with a per-connection hash so that they are flow within the datacenter over different paths (where ECMP is used within the datacenter), and received on different RX queues on the proxy server’s NIC (which similarly use a hash of TCP/IP header fields)."

A source port in the UDP header still needs to be be just that a port number no? Or are they actually stuffing a hash value in to that UDP header field? How would the receiving IP stack no how to understand a value other than a port number in that field?

gbrayut7y ago

Just a guess, but by using UDP transport for the encapsulated data and configuring the module on the proxy to accept UDP on a wide range of ports, you can pick any port you want (not just the destination port of the TCP stream in the encapsulated packet). And if you are using ECMP with a known hashing algorithm you can then use that UDP port to explicitly spread packets across the RX queues on the proxy servers (gaining better performance).

bogomipz7y ago

Thanks I think that might be what they mean - hash the source port for better distribution across the RX queues on the destination proxy. Cheers.

llama0527y ago· 2 in thread

Maybe I don't see the use case since I'm not at that scale, but it seems like a lot of added complexity for what appears to be hacking around using other load balancing solutions as a Layer-4 option?

Edit: It's a question, if you downvote please let me know why it's a better solution.

vbernat7y ago

The goal is to avoid disruption during topology changes. If you have long-lived connections, this is important to keep them alive. The explain this a bit more here: https://github.com/github/glb-director/blob/master/docs/deve...

I have also written about this in a past article: https://vincent.bernat.im/en/blog/2018-multi-tier-loadbalanc... (which may or may not be easier to understand)

ngrilly7y ago

I read your link, and it's easy to understand because it's really well explained :-)

q3k7y ago· 1 in thread

This basically looks like an open source Maglev [1]. Awesome!

[1] - https://static.googleusercontent.com/media/research.google.c...

vbernat7y ago

This is similar but not totally. Linux has an open source version of Maglev as a scheduler for LVS since 4.18 (to be released). GLB uses a specific consistent hashing algorithm selecting two servers and a module to let the first server redirect the flow to the second in case it doesn't know about it. This helps minimize disruption even more than with Maglev.

toomuchtodo7y ago· 1 in thread

Previous discussion (September 2016): https://news.ycombinator.com/item?id=12558053

Chris9117y ago

This new post is about the newly released GLB Director. The title should be changed.

koolhead177y ago· 1 in thread

Will it have some special love in azure ecosystem?

tootie7y ago

Azure already offers load balancing. I'm not sure how much differentiates all the products out there now. I've never seen a load balancer be a bottleneck in any system I've worked on.

weberc27y ago· 1 in thread

Where is the source code? Skimmed but didn’t see a link.

geospeck7y ago

https://github.com/github/glb-director

Drdrdrq7y ago

Off-topic: I love the GLB icon, pure genius!

subleq7y ago

This is the first I've heard of Rendezvous hashing. It seems superior in every respect to the ring-based consistent hashing I've heard much more about. Why is the ring-based method more common?

bogomipz7y ago

The article states:

>"Each server has a bonded pair of network interfaces, and those interfaces are shared between DPDK and Linux on GLB director servers."

What's the distinction between DPDK and Linux here? It wasn't clear to me why SR-IOV is needed in this design. Does DPGK need to "own" the entire NIC device is that? In other words using DPDK and regular kernel networking are mutually exclusive option on the NIC? Is that correct?

j / k navigate · click thread line to collapse

62 comments

39 comments · 13 top-level

KenanSulayman7y ago· 6 in thread

Why would one use this over HAProxy?

atombender7y ago

All of this thoroughly explained in the article.

llama0527y ago

I thought best practice for HAproxy was to run two HAProxy's in parallel with VRRP or DNS load balancing?

Does that not achieve the same outcome? I've used HAProxy for layer 4 in the past without any issue this way.

2 more replies

KenanSulayman7y ago

I have very successfully in the past and still do use HAProxy as level 4 LB. It's one of the fastest to my knowledge. I have used HAProxy as entry to big Mesos clusters without any issue before.

1 more reply

parthdesai7y ago

It's right there in 2nd paragraph.

amanzi7y ago

From the article:

KenanSulayman7y ago

1 more reply

justinsaccount7y ago· 5 in thread