Sorry for so many questions, but you made a big deal about how manual "DNS curation" was a bad thing, then glossed over the solution.
The greatest public DNS feature since sliced bread is Joyent's new CNS. Tag instances and they are available instantly through a CNAME. It's like the public equivalent of running Hashicorp's Consul. Freaking fantastic and makes me really glad I've stuck with the JPC for my infrastructure.
Ideally we hope to provide some followup posts that go deeper into technical detail about key pieces of the stack (DNS, initramfs framework, job broker, GCP usage, etc).
It sounds like they went to another method for service discovery, then created DNS entries from a DB either dynamically by registering in a zone or just a regular trigger pulled on DB update. Either way, it sounds like they moved the scary stuff to another level/service in the stack.
Also, linters exist for DNS and can be automated even with manual edits. Jenkins + gerrit makes easy work for this.
One has to wonder why they would opt for this. The entire story is a textbook example of where using a cloud would have been immensely better. Instead of leveraging mature pubic cloud offerings, they chose a path that evidently required huge amounts of developer time and resulted in a tremendous amount of pain/wasted time for downstream developers, only to scrap it in the end when they finally realized there's no point in trying to re-implement AWS/GCE. Think clouds are expensive? I'd love to quantify the number of wasted developer-hours resulting from this decision to use physical servers and see how it would stack up against even a very expensive AWS bill.
Would be good to hear their perspective on this.
I believe we started building this platform when AWS was very new, and hadn't seen a compelling reason to transition from it to the cloud until now. There's a couple of posts with more details behind our decision to go to GCP, but primarily it was to leverage their data tooling.
It's strange to me that this is still so common. My theory is that the "one machine one port" philosophy is still built into a lot of software (monitoring, the ELB, etc). Another is that this is the philosophy we've always known.
Take a look at Kubernetes. Everything is accessible via localhost:<some port>. that breaks most home-built and enterprise orchestration and monitoring tools spectacularly even though it's a much simpler mode (everything is a port, not ip port combo).
Density is much easier to accomplish on larger machines with more cores, which are elastic in the face of bursty residents. They are also generally cheaper per compute/memory.
IPv6 is practically built for containers, and, to Kubernetes's credit, they architected with that in mind. (Learned from BNS.) Weirdly, what I'm saying here was the original idea behind ports in the first place. There just aren't enough of them, particularly when half your space is shared with client sockets.
I want a world where v4 is pretty much just my control plane into the v6 cluster, since I'll die before IPv4. Google and far more importantly Amazon need to come up with a v6 story in their cloud offerings already. AWS has had a decade. This isn't just blind advocacy any more; the orchestration and software side is starting to build entire parts of the OSI stack because the network side of our industry is stuck without any sign of moving, no matter how dire the v4 situation.
However given Spotify's business position our priority has yet to shift from providing engineers compute capacity as fast as possible to optimising our usage of said compute capacity. It's all now somewhat of a moot point as we move away from our own hardware into Google's cloud.
It's also harder to separate two or more processes that 'grew up' together in the same container/machine/vm.
Let me fix that for you; stop gendering your servers.