A brief guide to Kubernetes networking (opens in new tab)

(ergomake.dev)

90 pointsthewizl3y ago19 comments

19 comments

19 comments · 6 top-level

jmillikin3y ago· 5 in thread

For folks interested in the Kubernetes networking model, I recommend looking elsewhere. This post appears to be content marketing wrapped in a thin shell of introductory tutorial for an approach that isn't used by non-trivial deployments.

First, if you want to do a networking tutorial, start with something simple -- an HTTP server and curl, for example. You want to be able to tcpdump the traffic to understand what's going on under the hood. The first half of the blog post is some massively complex magic -- CRDs? An operator? Just why? -- and there's no reason to run an Elasticsearch instance just to test out packet routing.

Second, you almost certainly don't want to use kube-proxy. It uses (abuses?) iptables/nftables in a way that will make your sysadmins cry tears of blood. For small deployments, every major cloud provider (AWS, Azure, GCP, etc) has a CNI plugin that lets you allocate pod IPs out of a dedicated NAT prefix. For larger or bare-metal deployments, either use IPv6 natively (if available) or 6to4 (if on an IPv4-only network). I wrote a tutorial on the 6to4 approach[0], but honestly if you have someone on staff who is familiar with the Linux kernel network configs they'll probably have a better idea of how to set it up to work with your system.

Third, you probably want to avoid getting super-magical with your DNS. Approaches like that described in the article (coredns configured to directly resolve non-namespaced Kubernetes service names) have poor performance once you get beyond toy-sized clusters, and having to hunt down all the places your code does a single-name lookup is not fun. Instead, configure a "normal" DNS server (or equivalent non-DNS address resolver) to read Kubernetes-announced endpoints in bulk (with caching, etc), and use hostnames like `myservice.mynamespace.mycluster.yourproddomain.com`, which lets you (1) figure out where your packets are getting routed to, and (2) provision mTLS certificates to pods that let them authenticate themselves as a given service identity. Yes, it's longer, but your future self (or future underlings) will thank you.

[0] https://john-millikin.com/stateless-kubernetes-overlay-netwo...

thewizlOP3y ago

Thanks for the comment. We totally agree with your suggested approach.

The reason we went for the example and set up in the blog post is that most people aren’t as familiar with the very basics. Therefore, we wanted to set up a plausible example (which is why we deployed ES instead of just a couple of busyboxes) and kept the examples self-contained.

Btw, really liked your link. Great blog post.

deathanatos3y ago

> Second, you almost certainly don't want to use kube-proxy. It uses (abuses?) iptables/nftables […] every major cloud provider ([…] Azure […])

…yeah… the IP "pool" for Services on Azure does come from a pool of NAT IPs, but it's still an imaginary pool? (It's not a real IP in the cloud's network — which is also more NAT IPs.) AFAICT, it's not different than how the article describes. kube-proxy runs in the cluster, even. (This is w/ the Azure CNI.)

In fact, the Service IP pool is so imaginary, if you accidentally allocate the same CIDR in the real Vnet, AKS gets all sorts of mad at you, and withholds upgrades until you repent. Support, similarly, won't do anything, no matter how obviously unrelated your actual problem is.

Contrast to GCP's GKE: service IPs show up in a VPC, IIRC as "secondary" IPs for a subnet. That does happen closer to how I think you're describing it. (GKE's model is night and day better. I hate finding out that, somehow, somewhere, in our past, someone created an overlapping network in Azure. It's a PITA to deal with, every time it happens.)

(Also, how much I'd love to jettison everything that has anything to do with NAT for IPv6.)

Also, IIRC, if you debug an AKS node, you'll see the iptables "abuse" you mention. And yes, it makes me cry tears of blood. I think IPVS is supposed to straighten some of that out and make me cry less tears of blood?

> provision mTLS certificates to pods that let them authenticate themselves as a given service identity.

You can do this with core-dns being your resolver, and using the normal unadorned service names. (CoreDNS really doesn't matter here.) It just changes what name you need to stick in the subjectAltNames. You'll need to "be your own CA" for this, of course. But cert-manager makes that pretty easy to do. (No more difficult than getting a cert from a real CA, and if anything … potentially easier.)

I've not had a problem with CoreDNS? Ours seems to cope fine at about 500 requests / min. The bigger problem is stuff in the cluster being really irresponsible about how many DNS queries they make…

jmillikin3y ago

Thank you for the correction regarding Azure! I knew they had a CNI plugin, but did't realize it has the limitations you describe (I've only used GCP and AWS).

  > I've not had a problem with CoreDNS? Ours seems to cope fine at about
  > 500 requests / min. The bigger problem is stuff in the cluster being
  > really irresponsible about how many DNS queries they make…

My expectation for DNS is that name lookups should function at thousands of requests per second. Serving that kind of QPS implies caching, and thus some sort of update notification -- for example watching the EndpointSlices resource for changes, and pushing them directly into the DNS server's state.

(As you might notice from my article + employment history, my expectations around cluster sizes differs from the community norm.)

Already__Taken3y ago

so do you make a point of changing SVC.cluster.local to your own TLD?

jmillikin3y ago

This might be a longer answer than you expected/wanted; sorry.

Let's say you have a bunch of machines, mounted in racks, installed in datacenters. You might have a naming convention like `m3r5dls1.acme-prod.com` to identify "machine #3 in rack #5 in datacenter #1 in The Dalles, Oregon". The acme-prod.com TLD is used to avoid cross-contamination between internal and public-facing identities. You'll have remote management services running on those machines, which identify themselves with an mTLS certificate issued to the machine hostname.

For services it's a little more complex, because there are at least three common use cases:

1. You have a service like `my-namespace/hello-world`, and you'd like to be able to send it RPCs. The clients don't care where the service backends are located, they just need to be able to resolve that service name to a set of (IP, port) pairs.

2. The same service as above, but now there's a locality requirement -- you want to send RPCs to a service in a specific place. Maybe for latency (closer = faster), maybe cost (local datacenter = no egress = cheaper), maybe regulatory ("all customers in India must have their data fetched from the Mumbai datacenter").

3. You have a specific instance of a service, for example when tracking the progress of a long-running operation. In Kubernetes this would be a specific pod.

It would also be nice if the services' mTLS certs are served from the same CA as machine certs, since that allows you to write tools that send RPCs without having to do their own parsing of the destination address.

The `{service_name}.cluster.local` format is not flexible enough to accommodate these goals, but if you're willing to use longer service identities then there's no problem:

1. `{service_name}.{namespace}.any.acme-prod.com` can resolve to any instance of that service, regardless of location.

2. `{service_name}.{namespace}.dls1.acme-prod.com` can resolve to any instance of that service in the datacenter `dls1`.

3. `{pod_id}.{service_name}.{namespace}.dls1.acme-prod.com` resolves to a specific instance.

Now you have consistent identities and universal mTLS authentication, so you can write authorization policies that grant services specific permissions they need to do their work.

ianpurton3y ago· 4 in thread

Does anyone know how they make those hand drawn style kubernetes diagrams?

potamic3y ago

Am I the only one who dislikes these hand drawn diagrams? At least the text parts of it. Pod looks like Pool, watch* looks like watcher and rewrite looks like rent. Look at the diagrams with both print as well as hand drawn text next to each other and tell me which one is easier to scan.

Objectively, there is no argument against print text being superior to hand written in every way from legibility, density and universality. If one only happened to have a real hand drawn diagram, you can of course excuse any lack of readability, but to go out of your way and intentionally degrade digital content to the level of hand drawn should be inexcusable. This is purely style over substance and that should have no place in technical content.

thewizlOP3y ago

Hey Ian,

Author here.

I’ve used Excalidraw and downloaded an external library with the k8s icons.

Also played around with non-default colors to help with info hierarchy.

Hope that helps!

If you need further info, my Twitter account is linked at the bottom of the post. Feel free to DM me there.

Aicy3y ago

Thanks. Which external library did you use?

cgeier3y ago

Could be excalidraw

[1] https://excalidraw.com/

Already__Taken3y ago· 2 in thread

honestly I learnt a lot about k8s networking from reading how ciliums bgp replacement for kube proxy works. real nice docs and some good diagrams.

stargrazer3y ago

s/bgp/ebpf/

bgp: border gateway protocol ebpf: extended berkely packet filter

Already__Taken3y ago

Not sure what you're getting at, cilium is built upon ebpf and xdp but that's not really anything to do with using bgp to announce networking directly instead of proxy traffic redirection. Those technologies seem to be used to simplify what needs to be announced from the clusters point of view but keep traffic flowing.

Confusingly there's also ebgp and ibgp to read up on. This is a dyslexics dream stack.

ilovecaching3y ago· 1 in thread

This is definitely just an ad for ergomake, there's barely any networking knowledge here at all.

quijoteuniv3y ago

Yes, fortunately there is HNers that provide the right link to the (clickbait) title. But using HN for this «smart» ads leaves a bad taste and most likely backfires. Enough spam already

lifty3y ago· 1 in thread

Question to the author. How does ergomake do multi tenancy on K8s?

thewizlOP3y ago

Great comment.

We’re writing a blog post about that right now. Should be up next week.

If you’re curious and would be up for a review, I’d be happy to send it to you first so you can read earlier!

revskill3y ago

Wait, if the documentation failed at explaining core technologies for a 5 years old baby to understand, it's a scam.

j / k navigate · click thread line to collapse

19 comments

19 comments · 6 top-level

jmillikin3y ago· 5 in thread

[0] https://john-millikin.com/stateless-kubernetes-overlay-netwo...

thewizlOP3y ago

Thanks for the comment. We totally agree with your suggested approach.

Btw, really liked your link. Great blog post.

deathanatos3y ago

> Second, you almost certainly don't want to use kube-proxy. It uses (abuses?) iptables/nftables […] every major cloud provider ([…] Azure […])

(Also, how much I'd love to jettison everything that has anything to do with NAT for IPv6.)

> provision mTLS certificates to pods that let them authenticate themselves as a given service identity.

I've not had a problem with CoreDNS? Ours seems to cope fine at about 500 requests / min. The bigger problem is stuff in the cluster being really irresponsible about how many DNS queries they make…

jmillikin3y ago

Thank you for the correction regarding Azure! I knew they had a CNI plugin, but did't realize it has the limitations you describe (I've only used GCP and AWS).

  > I've not had a problem with CoreDNS? Ours seems to cope fine at about
  > 500 requests / min. The bigger problem is stuff in the cluster being
  > really irresponsible about how many DNS queries they make…