First, if you want to do a networking tutorial, start with something simple -- an HTTP server and curl, for example. You want to be able to tcpdump the traffic to understand what's going on under the hood. The first half of the blog post is some massively complex magic -- CRDs? An operator? Just why? -- and there's no reason to run an Elasticsearch instance just to test out packet routing.
Second, you almost certainly don't want to use kube-proxy. It uses (abuses?) iptables/nftables in a way that will make your sysadmins cry tears of blood. For small deployments, every major cloud provider (AWS, Azure, GCP, etc) has a CNI plugin that lets you allocate pod IPs out of a dedicated NAT prefix. For larger or bare-metal deployments, either use IPv6 natively (if available) or 6to4 (if on an IPv4-only network). I wrote a tutorial on the 6to4 approach[0], but honestly if you have someone on staff who is familiar with the Linux kernel network configs they'll probably have a better idea of how to set it up to work with your system.
Third, you probably want to avoid getting super-magical with your DNS. Approaches like that described in the article (coredns configured to directly resolve non-namespaced Kubernetes service names) have poor performance once you get beyond toy-sized clusters, and having to hunt down all the places your code does a single-name lookup is not fun. Instead, configure a "normal" DNS server (or equivalent non-DNS address resolver) to read Kubernetes-announced endpoints in bulk (with caching, etc), and use hostnames like `myservice.mynamespace.mycluster.yourproddomain.com`, which lets you (1) figure out where your packets are getting routed to, and (2) provision mTLS certificates to pods that let them authenticate themselves as a given service identity. Yes, it's longer, but your future self (or future underlings) will thank you.
[0] https://john-millikin.com/stateless-kubernetes-overlay-netwo...
The reason we went for the example and set up in the blog post is that most people aren’t as familiar with the very basics. Therefore, we wanted to set up a plausible example (which is why we deployed ES instead of just a couple of busyboxes) and kept the examples self-contained.
Btw, really liked your link. Great blog post.
…yeah… the IP "pool" for Services on Azure does come from a pool of NAT IPs, but it's still an imaginary pool? (It's not a real IP in the cloud's network — which is also more NAT IPs.) AFAICT, it's not different than how the article describes. kube-proxy runs in the cluster, even. (This is w/ the Azure CNI.)
In fact, the Service IP pool is so imaginary, if you accidentally allocate the same CIDR in the real Vnet, AKS gets all sorts of mad at you, and withholds upgrades until you repent. Support, similarly, won't do anything, no matter how obviously unrelated your actual problem is.
Contrast to GCP's GKE: service IPs show up in a VPC, IIRC as "secondary" IPs for a subnet. That does happen closer to how I think you're describing it. (GKE's model is night and day better. I hate finding out that, somehow, somewhere, in our past, someone created an overlapping network in Azure. It's a PITA to deal with, every time it happens.)
(Also, how much I'd love to jettison everything that has anything to do with NAT for IPv6.)
Also, IIRC, if you debug an AKS node, you'll see the iptables "abuse" you mention. And yes, it makes me cry tears of blood. I think IPVS is supposed to straighten some of that out and make me cry less tears of blood?
> provision mTLS certificates to pods that let them authenticate themselves as a given service identity.
You can do this with core-dns being your resolver, and using the normal unadorned service names. (CoreDNS really doesn't matter here.) It just changes what name you need to stick in the subjectAltNames. You'll need to "be your own CA" for this, of course. But cert-manager makes that pretty easy to do. (No more difficult than getting a cert from a real CA, and if anything … potentially easier.)
I've not had a problem with CoreDNS? Ours seems to cope fine at about 500 requests / min. The bigger problem is stuff in the cluster being really irresponsible about how many DNS queries they make…
> I've not had a problem with CoreDNS? Ours seems to cope fine at about
> 500 requests / min. The bigger problem is stuff in the cluster being
> really irresponsible about how many DNS queries they make…
My expectation for DNS is that name lookups should function at thousands of requests per second. Serving that kind of QPS implies caching, and thus some sort of update notification -- for example watching the EndpointSlices resource for changes, and pushing them directly into the DNS server's state.(As you might notice from my article + employment history, my expectations around cluster sizes differs from the community norm.)
Let's say you have a bunch of machines, mounted in racks, installed in datacenters. You might have a naming convention like `m3r5dls1.acme-prod.com` to identify "machine #3 in rack #5 in datacenter #1 in The Dalles, Oregon". The acme-prod.com TLD is used to avoid cross-contamination between internal and public-facing identities. You'll have remote management services running on those machines, which identify themselves with an mTLS certificate issued to the machine hostname.
For services it's a little more complex, because there are at least three common use cases:
1. You have a service like `my-namespace/hello-world`, and you'd like to be able to send it RPCs. The clients don't care where the service backends are located, they just need to be able to resolve that service name to a set of (IP, port) pairs.
2. The same service as above, but now there's a locality requirement -- you want to send RPCs to a service in a specific place. Maybe for latency (closer = faster), maybe cost (local datacenter = no egress = cheaper), maybe regulatory ("all customers in India must have their data fetched from the Mumbai datacenter").
3. You have a specific instance of a service, for example when tracking the progress of a long-running operation. In Kubernetes this would be a specific pod.
It would also be nice if the services' mTLS certs are served from the same CA as machine certs, since that allows you to write tools that send RPCs without having to do their own parsing of the destination address.
The `{service_name}.cluster.local` format is not flexible enough to accommodate these goals, but if you're willing to use longer service identities then there's no problem:
1. `{service_name}.{namespace}.any.acme-prod.com` can resolve to any instance of that service, regardless of location.
2. `{service_name}.{namespace}.dls1.acme-prod.com` can resolve to any instance of that service in the datacenter `dls1`.
3. `{pod_id}.{service_name}.{namespace}.dls1.acme-prod.com` resolves to a specific instance.
Now you have consistent identities and universal mTLS authentication, so you can write authorization policies that grant services specific permissions they need to do their work.
Objectively, there is no argument against print text being superior to hand written in every way from legibility, density and universality. If one only happened to have a real hand drawn diagram, you can of course excuse any lack of readability, but to go out of your way and intentionally degrade digital content to the level of hand drawn should be inexcusable. This is purely style over substance and that should have no place in technical content.
Author here.
I’ve used Excalidraw and downloaded an external library with the k8s icons.
Also played around with non-default colors to help with info hierarchy.
Hope that helps!
If you need further info, my Twitter account is linked at the bottom of the post. Feel free to DM me there.
bgp: border gateway protocol ebpf: extended berkely packet filter
Confusingly there's also ebgp and ibgp to read up on. This is a dyslexics dream stack.
We’re writing a blog post about that right now. Should be up next week.
If you’re curious and would be up for a review, I’d be happy to send it to you first so you can read earlier!