Someone decided they have to have a public cloud, so they did it, but they want to keep clients away with a 3 meter pole.
My AWS account manager is someone I am 100% certain would roll in the mud with me if necessary. Would sleep in the floor with us if we asked in a crisis.
Our Google cloud representatives make me sad because I can see that they are even less loved and supported by Google than us. It’s sad seeing someone trying to convince their company to sell and actually do a good job providing service. It’s like they are setup to fail.
Microsoft guys are just bulletproof and excel in selling, providing a good service and squeezing all your money out of your pockets and you are mortally convinced it’s for your own good. Also have a very strange cloud… thing.
As for the railway company going metal, well, I have some 15 years of experience with it. I’ll never, NEVER, EVER return to it. It’s just not worth it. But I guess you’ll have to discover it by yourselves. This is the way.
You soon discover what in freaking world is Google having so much trouble with. Just make sure you really really love and really really want to sell service to people, instead of building borgs and artificial brains and you’ll do 100x better.
I’ve been on 4 hour screenshare with AWS engineers working through some infrastructure issues in the past, and we only spend $100k/yr.
Even at the $100k/yr spend level, AWS regularly reaches out with offers to try new services they’re launching for free. We’ve said “sure” a couple times, and AWS shows up with 4-6 people on their end of the call (half of them engineers).
In the past 10 years, we’ve had maybe 2-3 emergency issues per year, and every time I’m able to get a really smart person on a call within 5 minutes.
This is the #1 thing I’d be concerned about losing if we did colo or bare metal with cheaper providers.
When I was at AWS, our team used to (religiously / proactively) keep track of customers having multiple complaints, especially repeat complaints (all of which manifested in to some form of downtime for them). Regardless of their spend, these customers ended up getting the "white glove" treatment, which otherwise is reserved for (potential) top spenders (though, engs are mostly oblivious to the numbers).
This is besides the fact that some account managers & support engs may indeed escalate (quite easily at that) to push product eng teams to really & immediately pay that tech debt that's hurting their customers.
2. More
3. AWS has this idea of “customer obsession.” They will spend an absurd amount of time trying to understand your business and make sense of it.
I assume that's written into the contract somewhere and not a kickback, right?
Unless the company is yours or it's a private company that can raise a compliance issue...Any other gifts?
Risking a going off on a tangent, this is something I rarely see discussed but is perhaps one of the main problems with Azure. The whole cloud service feels like something someone oblivious to cloud computing would design if all they knew was renting bare metal servers. It's cloud computing in a way that completely defeats the whole concept of cloud computing.
Then find out it's not good at all and go "oh well, I guess we'll polish it over in the UI" (not knowing that no serious scale works with a UI).
If I can't have AWS I'll make do with GCP. But if someone wants to go full Azure, I'll find work elsewhere. Screw that. Life is too short to work with bad technology.
Most telling example was how iirc Terraria was a launch highlight for Stadia to show awesome indies, then somehow their magic systems lock down the developers account and despite internal pressure from Stadia devrel people they don't get it back in time until the developer just cancels development of the Stadia port. https://www.reddit.com/r/Games/comments/lf7iie/terraria_on_s...
Would save me months of lead time.
Personal experience goes that Google Cloud support treated us quite well even when called by small 3 person team doing minuscule spend, in another company Microsoft treated us very well but our spend could be probably tracked by nationwide powergrid monitoring of their datacenters.
And AWS lied about features and ultimately never responded back.
I figure the account managers talking to high level management about contracting mandatory multi-million spend on AWS know how to talk with said management.
But at the end, what comes to actually developing and delivering products for others, we were left in the dust.
To make it funnier, part of what made it so hard was that the feature they lied to us was supposed to be critical for making sure the UX for end-users was really stellar.
I’m quite happy I’m not using aws - in my case (hpc, spot instances don’t work ) they don’t work.
- Some EMC guys came to install a storage device for us to test... and tripped over each other and knocked out an entire Rack of servers like a comedy skit. (They uh... didn't win the contract.)
- Some poor guy driving a truck had a heart attack and the crash took our DFW datecenter offline. (There were ballards to prevent this sort of scenario, but the cement hadn't been poured in them yet.)
- At one point we temporarily laser-beamed bandwidth across the street to another building
- There was one day we knocked out windows and purchased box fans because servers were literally catching on fire.
Data center science has... well improved since the earlier days. We worked with Facebook on the OpenCompute Project that had some very forward looking infra concepts at the time. - A key microwave link kept going down with intermittent packet errors way down in the data link layer. A short investigation discovered that a tree across the road had come into leaf, and a branch was blowing into the line of sight of the kit on our building. A step-ladder, a saw and 10 minutes later we restored connectivity
- Our main (BGP-ified) router out of the DC - no, there wasn't a redundant device - kept rebooting. A quick check showed the temp in the DC was so high, cooling so poor, that the *inlet* fan had an air temp of over 60C. We pointed some fans at it as a temporary measure.
- In a similar vein, a few weeks later the air con in another room just gave up and started spewing water over the Nortel DMS-100 (we were a dial-in ISP with our own switch). Wasn't too happy to be asked to help mop it up (thought the water could potentially be live), but what to do?
After that experience I spent time on a small, remote island where main link to the internet was a 1MB/sec link vis GS satellite (ping times > 500ms), and where the locals dialled in over a microwave phone network rated to 9600 baud, but somehow 56k modems worked... One fix I realised I needed was a Solaris box was missing a critical .so, there were no local backups or install media and so I phoned my mate back in the UK and asked him to whack up a copy on an FTP server for me to get the box back online.And a few years after that I also got to commission a laser beam link over Manchester's Oxford Road (at the time, the busiest bus route in Europe), to link up an office to a University campus. Fun times.
It was all terrific fun, but I'm so glad I now only really do software.
I don't blame you, a lot of us had to do things outside the box. Could be worse though, I saw a post on r/sysadmin yesterday where a poor guy got a support ticket to spray fox urine outside near the generators.
You say that, but...
> There was one day we knocked out windows and purchased box fans because servers were literally catching on fire
This happened to Equinix's CH1 datacenter in Chicago Jan24 (not the literal fire part). Took down Azure ExpressRoute.
Apparently it got too cold and the CRACs couldn't take it? I'm told they had all the doors and windows open trying to keep things cold enough, but alas. As the CRAC goes, so goes the servers
It was also 115 degrees ambient temp inside CH1. Techs were dipping in and out 5-10 minutes at a time to avoid heat stroke
absolutely do not miss those days
Yikes, that escalated quickly. I'm glad you escaped the Switch Grim Reaper and my condolences to the families of the rest :(
Pointing the fans in or out?
Am a bit surprised Meta doesn't offer a cloud provider yet to compete with AWS/GCP. Especially considering how much R&D they've put into their infra.
Con: interacting with internal stakeholders is waaaaay different from doing the same for the general public paying you. See also: every mention of GCP that ever shows up in these threads
Plus all their SDKs would be written in php :-P
One of the stories was learning that stuff on top gets hotter than stuff on bottom.
This is, like, basic stuff here, guys. I've never understood the hiring practices in these projects.
Ah yes, or a collection of R2D2 portable air conditioners, with the tails draped out through the window.
Or a coolant leak that no one noticed until the sub-floor was completely full and the floor panels started to float!
I really enjoyed this post, mostly because we had similar adventures when setting up the infrastructure for Blekko. For Blekko, a company that had a lot of "east west" network traffic (that is traffic that goes between racks vs to/from the Internet at large) having physically colocated services without competing with other servers for bandwidth was both essential and much more cost effective than paying for this special case at SoftLayer (IBM's captive cloud).
There are some really cool companies that will build an enclosure for your cold isle, basically it ensures all the air coming out of the floor goes into the back of your servers and not anywhere else. It also keeps not cold air from being entrained from the sides into your servers.
The calculations for HVAC 'CRAC' capacity in a data center are interesting too. In the first CoLo facility we had a 'ROFOR' (right of first refusal) on expanding into the floor area next to our cage, but when it came time to expand the facility had no more cooling capacity left so it was meaningless.
Once you've done this exercise, looking at the 0xide solution will make a lot more sense to you.
You’re an infrastructure company. You gotta own the metal that you sell or you’re just a middleman for the cloud, and always at risk of being undercut by a competitor on bare metal with $0 egress fees.
Colocation and peering for $0 egress is why Cloudflare has a free tier, and why new entrants could never compete with them by reselling cloud services.
In fact, for hyperscalers, bandwidth price gouging isn’t just a profit center; it’s a moat. It ensures you can’t build the next AWS on AWS, and creates an entirely new (and strategically weaker) market segment of “PaaS” on top of “IaaS.”
With this, it'll mean we can slash that in half, lower storage costs, remove "per seat" pricing, etc
Super exciting
https://github.com/netbox-community/netbox/issues?q=is%3Aiss...
Note how they want to be "NetBox functions as the source of truth for your network infrastructure."
Your individual situation dictates what is important, but had netbox targeted being a central repository vs insisting on not allow other systems to be truthful for certain items it could be a different story.
We have learned that trying to centralize complexity and control doesn't work, heck we knew that almost immediately after the Clinger Cohen Act passed and even ITIL and TOGAF fully call this out now and I expect this to be targeted by consultants over the next few years.
You need a central constant way to find state, to remove any questions or doubt regarding where to find the authoritative information, but generally if you aspire to scale and grow or adapt to new changes you really need to avoid having some centralized, god-box, and prescriptive system like this.
This is the usual case of "We need X and Y does X", but ignoring that Y also does Z,M,Q and washes dishes and you really don't need those things.
Sometimes building what you need is the easiest solution, specially when what you need is CRUD infront of a DB...
In this case railway will need to care about a lot of extra information beyond just racks, IP addresses and physical servers.
It feels more like an OSS tool for managing university campus scale infra, which is completely fine if that is the problem you have but for commercial scale infrastructure unfortunately there isn't a good OOTB DCIM option right now.
Wishing all the best to this team, seems like fun!
Cable management and standardization was extremely important (like you couldn't get by with shitty practices). At one place where we were deploying hundreds of servers per week, we had a menu of what ops people could choose if the server was different than one of the major clusters. We essentially had 2 chassis options, big disk servers which were 2u or 1u pizza boxes. You then could select 9/36/146gb SCSI drives. Everything was dual processor with the same processors and we basically had the bottom of the rack with about 10x 2u boxes and then the rest was filled with 20 or more 1u boxes.
If I remember correctly we had gotten such an awesome deal on the price for power, because we used facility racks in the cage or something, since I think they threw in the first 2x 30 amp (240v) circuits for free when you used their racks. IIRC we had a 10 year deal on that and there was no metering on them, so we just packed each rack as much as we could. We would put 2x 30s on one side and 2x 20s on another side. I have to think that the DC was barely breaking even because of how much heat we put out and power consumption. Maybe they were making up for it in connection / peering fees.
I can't remember the details, will have to check with one of my friends that worked there around that time.
Take Netflix. While almost everything is in the cloud the actual delivery of video is via their own hardware. Even at their size I doubt this business would be economically feasible if they were paying someone else for this.
Something I've seen often (some numbers changed because...)
20 PB Egress at $0.02/GB = $400,000/month
20 PB is roughly 67 Gbps 95th Percentile
It's not hard to find 100 Gbps flat rate for $5,000/month
Yes this is overly simplistic, and yes there's a ton more that goes into it than this. But the delta is significant.
For some companies $4,680,000/year doesn't move the needle, for others this could mean survival.
Did you standardize on layout at the rack level? What poke-yoke processes did you put into place to prevent mistakes?
What does your metal->boot stack look like?
Having worked for two different cloud providers and built my own internal clouds with PXE booted hosts, I too find this stuff fascinating.
Also take utmost advantage of a new DC when you are booting it to try out all the failure scenarios you can think of and the ones you can't through randomized fault injection.
I'm going to save this for when I'm asked to cut the three paras on power circuit types.
Re: standardising layout at the rack level; we do now! we only figured this out after site #2. It makes everything so much easier to verify. And yeah, validation is hard - manually doing it thus far; want to play around with scraping LLDP data but our switch software stack has a bug :/. It's an evolving process, the more we work with different contractors, the more edge cases we unearth and account for. The biggest improvement is that we have built a internal DCIM that templates a rack design and exports a interactive "cabling explorer" for the site techs - including detailed annotated diagrams of equipment showing port names, etc... The screenshot of the elevation is a screenshot of part of that tool.
> What does your metal->boot stack look like?
We've hacked together something on top of https://github.com/danderson/netboot/tree/main/pixiecore that serves a debian netboot + preseed file. We have some custom temporal workers to connect to Redfish APIs on the BMCs to puppeteer the contraption. Then a custom host agent to provision QEMU VMs and advertise assigned IPs via BGP (using FRR) from the host.
Re: new DCs for failure scenarios, yeah we've already blown breakers etc... testing stuff (that's how we figured out our phase balancing was off). Went in with a thermal camera on another. A site in AMS is coming up next week and the goal for that is to see how far we can push a fully loaded switch fabric.
The edge cases are the gold btw, collect the whole set and keep them in a human and machine readable format.
I'd also go through and using a color coded set of cables, insert bad cables (one at a time at first) while the system is doing an aggressive all to all workload and see how quickly you can identify faults.
It is the gray failures that will bring the system down, often multiple as a single failure will go undetected for months and then finally tip over an inflection point at a later time.
Are you workloads ephemeral and/or do they live migrate? Or will physical hosts have long uptimes? It is nice to be able to rebaseline the hardware before and after host kernel upgrades so you can detect any anomalies.
You would be surprised about how larger of a systemic performance degradation that major cloud providers have been able to see over months because "all machines are the same", high precision but low absolute accuracy. It is nice to run the same benchmarks on bare metal and then again under virtualization.
I am sure you know, but you are running a multivariate longitudinal experiment, science the shit out of it.
It's written for Cumulus Linux, but it should be adaptable to other NOSes with some work: https://github.com/CumulusNetworks/ptm
You give it a graphviz dot file, and it uses LLDP to ensure that reality matches that file.
To put this cost into perspective, you can buy two brand new 32 port 100G switches from Arista for the same amount of money. In North America, you can get 100G WAN circuits (managed Wavelength) for less than $5K/month. If it's a local metro you can also get dark fiber for less and run whatever speed you want.
This lets one get closer to the metal (e.g. all your data is on your specific disk, rather than an abstracted block storage, such as EBS, not shared with other users, cheaper, etc) without having to worry about the staff that installs the hardware or where/how it fits in a rack.
For us, this was a way to get 6x performance for 1/6 of the cost. (Excluding, of course our time, but we enjoyed it!)
Our prod is on AWS but we plan to move everything else and it's expected to save at least a quarter of a million dollars per year
Could be worth adding a <meta> tag to the <head> so that RSS readers can autodiscover the feed. A random link I found on Google: https://www.petefreitag.com/blog/rss-autodiscovery/
>The calculations aren’t as simple as summing watts though, especially with 3-phase feeds — Cloudflare has a great blogpost covering this topic.
What's written in the Cloudflare blogpost linked in the article holds true only of you can use a Delta config (as done in the US to obtain 208V) as opposed to the Wye config used in Europe. The latter does not give a substantial advantage: no sqrt(3) boost to power distribution efficiency and you end up adding Watts for three independent single phase circuits (cfr. https://en.m.wikipedia.org/wiki/Three-phase_electric_power).
But, I'd need to start off small, probably per-cabinet UPSes and transfer switches, smaller generators. I've built up cabinets and cages before, but never built up the exterior infrastructure.
If it turns out to be any of “location, location, location” then getting a partially kitted out building may not help you.
Did they get independent data into the building via different routes? How’s the power?
Could be the data was coming in through a route that sees frequent construction. I knew a guy who ran the IT dept for a university and he discovered that the excavation crews found it was cheaper to maybe have to pay a fine for cutting data lines than it was to wait for them to be marked accurately. He spent a lot of time being stressed out.
Location is fairly good, as far as data centers go. It's got relatively good network connectivity, I believe, but I don't have specifics about entrances and diversity. It is close to one of the big fiber rings around the city, I believe the ring is pulled into the facility. I don't know if they had telco fiber in, or backhauled it via the fiber ring.
Power is probably good, but not great -- I'd doubt it's fed from multiple substations. There was, at one point, some generator bays.
While I could use data center space in town, it'd be hard to convince my work to move, partly as we just signed a 3 year agreement for hosting 60 miles away, partly just because of the cost of a move. It probably should remain a pipe dream.
Perhaps I am reading this wrong, as you appear to be fiber heavy and do have space on the ladder rack for copper, but if you are commingling the two, be careful. A possible future iteration, would consider a smaller panduit fiberunner setup + a wire rack.
Co-mingling copper and fiber, especially through the large spill-overs works until it doesn't.
Depending on how adaptive you need to be with technology changes, you may run into this in a few years.
4x6 encourages a lot of people putting extra cable up in those runners, and sharing a spout with cat-6, cx-#, PDU serial, etc... will almost always end badly for some chunk of fiber. After those outages it also encourages people to 'upgrade in place'. When you are walking to your cage look at older cages, notice the loops sticking out of the tops of the trays and some switches that look like porcupines because someone caused an outage and old cables are left in place.
Congrats on your new cage.
Turns out we are building like mad and we are still not building enough.
I remember 30 years ago most people used one single computer for the family at home, half of the people i knew didn't have proper internet access. (and this is from a western perspective, the rest of the world was even far less digitized).
Now look at how much networked computers are around you. - your phone - one or multiple TV's - laptops/desktops etc - smart home appliances.
and this is just looking at the very small sample size of a normal household, add up to that things like the digitizalisation of factories, the digitalization of the rest of the world. (internet access has grown massively in the past 15 years in the developing world).
We have far more computers then a decade ago, and far more people have them aswell, and it shows very little signs of stopping.
IPv6 for instance, supports an absolutely infanthomable IP address space. (which people often seem to think is overkill), but looking at the past growth, i think having suchs a large address space is a wise choice.
Another thing which people seem to not notice is that a lot of older DC's are being phased out, mainly because these facilities are repurposed telephone exchanges and far less suitable for more power hungry computing power.
Yes that was already taken into account. We rushed pass active usage of Smartphone of 4B in 2020. With additional 1B user who have access to 4G / Smartphone but not using any Data. 1B people who cant afford it or are using feature phone. And around 1B people who are outside the age of using smartphone, child, baby etc. That is a total Around 7B people already. And anything after is a long tail of new generation outpacing older generation. Tablet usage has levelled off. PC market hasn't had new growth outside China and India. COVID managed to fasten a lot these digitalisation. I wrote about how the Growth of AWS in 2023 was roughly equal to doubling of itself in 2016. i.e 2023 AWS was building out the size of AWS 2016. That is insane. And yet we are still building more.
>Another thing which people seem to not notice is that a lot of older DC's are being phased out,
That is something I am not aware. But certainly will keep a look out. This got me thinking if building out a new DC is easier and cheaper than repurposing older DC not designed for high computing power density usage? While I have said we could increase Compute, RAM , Storage density in Rack by 10-20x, we have also increase power usage by 4 - 5x. Not only electricity / power usage but cooling and design also requires additional thoughts.
"we kicked off a Railway Metal project last year. Nine months later we were live with the first site in California".
seems inconsistent with:
"From kicking off the Railway Metal project in October last-year, it took us five long months to get the first servers plugged in"
The article was posted today (Jan 2025), was it maybe originally written last year and the project has been going on for more than a year, and they mean that the Railway Metal project actually started in 2023?
Timeline wise; - we decided to go for it and spend the $$$ in Oct '23 - Convos/planning started ~ Jan '24 - Picked the vendors we wanted by ~ Feb/Mar '24 - Lead-times, etc... meant everything was ready for us to go fit the first gear by mostly ourselves at the start of May (that's the 5mo) - We did the "proper" re-install around June, followed closely by the second site in ~ Sep, around when we started letting our users on it as a open beta - Sep-Dec we just doubled down on refining software/automation and process while building out successive installs
Lead times can be mind numbing. We have certain switches from Arista that have a 3-6 mo leadtime. Servers are build to order, so again 2+ months depending on stock. And obv. holidays mean a lot of stuff shuts down around December.
Sometimes you can swap stuff around to get better lead-times, but then the operational complexity explodes because you have this slightly different component at this one site.
I used to be a EEE, and I thought supply chain there was bad. But with DCs I think it's sometimes worse because you don't directly control some parts of your BoM/supply chain (especially with build-to-order servers).
The advantage at cloud scale is a lot of constant signal around capacity delivery, demand etc. so you can build mathematical models to best work out when to start placing orders, and for what.
Railway: No, your margin is my opportunity.
We can lower that once we’re fully on metal
We provide a small PaaS-like hosting service, kinda similar to Railway (but more niche). We have recently re-elaborated our choice for AWS (since $$$) as infra provider, but will now stick to it [1].
We started with colocation 20 years ago. For a tiny provider it was quite a hustle (but also an experience). We just had too many single point of failures and we found ourselves dealing with physical servers way too often. We also struggled to fade out and replace hardware.
Without reading all the comments thoroughly: For me, being on infra that runs on green energy is important. I think it's also a trend with customers, there even service for this [2]. I don't see it mentioned here.
[1] https://blog.fortrabbit.com/infra-research-2024 [2] https://www.thegreenwebfoundation.org/
Amazingly, few companies who run their own DCs could build anything comparable to EC2, even at a smaller scale. When I worked in those companies, I sorely missed EC2. I was wondering if there's any robust enough open-source alternatives to EC2's control-plane software to manage baremetals and offer VMs on top them. That'll be awesome for companies that build their own DCs.
At this scale, why did you opt for a spine-and-leaf design with 25G switches and a dedicated 32×100G spine? Did you explore just collapsing it and using 1-2 32×100G switches per rack, then employing 100G>4×25G AOC breakout cables and direct 100G links for inter-switch connections and storage servers?
Have you also thought about creating a record on PeeringDB?https://www.peeringdb.com/net/400940.
By the way, I’m not convinced I’d recommend a UniFi Pro for anything, even for out-of-band management.
When we started, we didn't have much of an idea about what the rack needs to look like. So we chose a combination of things we thought we could pull this off. We're mostly software and systems folks, and there's a dearth of information out there on what to do. Vendors tend to gravitate towards selling BGP+EVPN+VXLAN or whatever "enterprise" reference designs; so we kinda YOLO'ed the Gen 1. We decided to spend extra money if we could get to a working setup sooner. When the clock is in cloud spend, there's uh... lots of opportunity cost :D.
A lot of the chipset and switch choices were bets and we had to pick and choose what we gambled on - and what we could get our hands on. The main bets this round were eBGP to the hosts with BGP unnumbered, SONiC switches - this lets us do a lot of networking with our existing IPv6/Wireguard/eBPF overlay and a debian based switch OS + FRR (so fewer things to learn). And ofc. figuring out how to operationalise the install process and get stuff running on the hardware as soon as possible.
Now we've got a working design, we'll start iterating a bit more on the hardware choice and network design. I'd love for us to write about it when we get through it. Plus I think we owe the internet a rant on networking in general.
Edit: Also we don't use UniFi Pro / Uniquity gear anywhere?
It was my first job out of university. I will never forget the awesome experience of walking into the datacenter and start plugging cables and stuff
2. Geographical distanced backups, if the primary fails. Without this you are already in trouble. What happens if the buildings burns down?
3. Hooking up with "local" ISPs That seems ok. As long as ISP failing is easily and autoamically dealt with.
4. I am a bit confused about what happens at the edge. On the one head it seems like you have 1 datacenter, and ISPs doing routing, other places I get the impression you have compute close to the edge. Which is it?
2. In the diagram you can see site 1 and site 2.
3. Yes, routers automatically deal with ISP failures.
Meta-comment: it's getting really hard to find hosting services that provide true unlimited bandwidth. I want to do video upload/download in our app, and I'm struggling to find providers of managed servers that would be willing to provide me with fixed price for 10/100GB ports.
A 10G port should be in the range of $2k per month, I believe? I don't mind paying that much.
I‘ve used their postgres offering for a small project (crucially it was accessible from the outside) and not only was setting it up a breeze, cost was also minimal (I believe staying within the free tier). I haven’t used the rest of the platform, but my interaction with them would suggest it would probably be pretty nice.
Is it a good or a bad thing to have the same customer support across the board?
Like, if Terraform had a nice UI?
If you've heard of serverless, this is one step farther; infraless
Give us your code, we will spin it up, keep it up, automate rollouts service discovery, cluster scaling, monitoring, etc
Like Vercel but not just for front end
I've been using railway since 2022 and it's been great. I host all my personal projects there and I can go from code to a url by copy-pasting my single dockerfile around.
LOL?
Using PXE to bootstrap an installer kernel (only few MB) over TFTP that fetches the rest of the OS over HTTP is quick and you can pressed/kickstart a machine in minutes.
What would you say are your biggest threats?
Power seems to the big one, especially when the AI power and electric vehicle demand will drive up kWh prices.
Networking seems another one. I'm out of the loop, but it seems to me like the internet is still stuck at 2010 network capacity concepts like "10Gb". If networking had progressed as compute power has (e.g. NVMe disks can provide 25GB/s), 100Gb would be the default server interface? And the ISP uplink would be measured in terabits?
How is the diversity in datacenter providers? In my area, several datacenters were acquired and my instinct would be that: the "move to cloud" has lost smaller providers a lot of customers, and the industry consolidation has given suppliers more power in both controlling the offering and the pricing. Is it a free market with plenty of competitive pricing, or is it edging towards enshittification?
High end network interfaces are entering the 800Gbps interface era right now.
also, in 2010 10Gbps network connectivity to end hosts was NOT common. (it was common for router uplinks and interconnects though.
Network interfaces have not scaled as nicely because getting fast enough lasers to handle higher then 100Gbps has been a challenge, and getting to higher speeds basically means doing wave division multiplex over multiple channels across a single fiber.
Also, density of connections per fiber has increased massivly because the cost of DWDM equipment has come down significantly.
Tons of Colocation available nearly everywhere in the US, and in the KCMO area, there are even a few dark datacenters available for sale!
cool project none-the-less. Bit jealous actually :P
So, while we could have bought something off the shelf, that would have been suboptimal from a specs perspective. Plus then we'd have to source supply chain etc.
By owning not just the servers but the whole supply chain, we have redundancy at every layer, from the machine, to the parts on site (for failures), to the supply chain (refilling those spare parts/expanding capacity/etc)
Application Developer
DevOps Engineer
Site Reliability Engineer
Storage Engineer
Good luck, hope you pay them well.
Still kudos going this path in the cloud-centric time we live in.
Thy will vary by country, by state or even county , setting up a DC in the Bay Area and say one in Ohio or Utah is a very different endeavor with different design considerations.
One of the better was the dead possum in the drain during a thunderstorm.
>So do we throw the main switch before we get electroduced? Or do we try to poke enough holes in it that it gets flushed out? And what about the half million in servers that are going to get ruined?
Sign up to my patreon to find out how the story ended.
We had one provider give us a great price and then bait and switch at the last moment to tell us that there is some other massive installation charge that they didn't realize we had to pay.
Switch Connect/Core is based off the old Enron business that Rob (CEO) bought...
https://www.switch.com/switch-connect/ https://www.switch.com/the-core-cooperative/
The cynic in me says this was written by sales/marketing people targeted specifically at a whole new generation of people who've never laid hands on the bare metal or racked a piece of equipment or done low voltage cabling, fiber cabling, and "plug this into A and B power AC power" cabling.
By this, I mean people who've never done anything that isn't GCP, Azure, AWS, etc. Many terminologies related to bare metal infrastructure are misused by people who haven't been around in the industry long enough to have been required to DIY all their own infrastructure on their own bare metal.
I really don't mean any insult to people reading this who've only ever touched the software side, but if a document is describing the general concept of hot aisles and cold aisles to an audience in such a way that it assumes they don't know what those are, it's at a very introductory/beginner level of understanding the OSI layer 1 infrastructure.
I wanted to start off with the 101 content to see if people found it approachable/interesting. He's got like reams and reams of 201, 301, 401
Next time I'll stay out of the writing room!
When the original aws instance came out it would take you about two years or on demand to pay for the same hardware on prem. Now its between two weeks for ml heavy instances to six months for medium CPU instances.
It just doesn't make sence to use the cloud for anything past prototyping unless you want Bazos to have a bigger yacth.
Step 1: sign a lease at an apartment
HN people are smart
For people who have taken empty lots and constructed new data centers (ie, the whole building) on them from scratch, the phrase "building a datacenter" involves a nonzero amount of concrete.
OP seems to have built out a data hall - which is still a cool thing in its own right! - but for someone like me who's interested in "baking an apple pie from scratch", the mismatch between the title and the content was slightly disappointing.
I think you're failing to understand the meaning and the point of "building your own datacenter".
Yes, you can talk about your office all you'd like. Much like OP can talk about there server farm and their backend infrastructure.
What you cannot talk about is your own office center. You do not own it. You rent office space. You only have a small fraction of the work required to operate an office, because you effectively offloaded the hard part to your landlord.
Cloudflare has also historically used “datacenter” to refer to their rack deployments.
All that said, for the purpose of the blog post, “building your own datacenter” is misleading.
cloud.google.com/about/locations lists all the locations that GCE offers service, which is a super set of the large facilities that someone would call a "Google Datacenter". I liked to mostly refer to the distinction as Google concrete (we built the building) or not. Ultimately, even in locations that are shared colo spaces, or rented, it's still Google putting custom racks there, integrating into the network and services, etc. So from a customer perspective, you should pick the right location for you. If that happens to be in a facility where Google poured the concrete, great! If not, it's not the end of the world.
P.S., I swear the certification PDFs used to include this information (e.g., https://cloud.google.com/security/compliance/iso-27018?hl=en) but now these are all behind "Contact Sales" and some new Certification Manager page in the console.
Edit: Yes! https://cloud.google.com/docs/geography-and-regions still says:
> These data centers might be owned by Google and listed on the Google Cloud locations page, or they might be leased from third-party data center providers. For the full list of data center locations for Google Cloud, see our ISO/IEC 27001 certificate. Regardless of whether the data center is owned or leased, Google Cloud selects data centers and designs its infrastructure to provide a uniform level of performance, security, and reliability.
So someone can probably use web.archive.org to get the ISO-27001 certificate PDF from whenever the last time it was still up.
Even where they do lease wholesale space, you'd be hard pushed to find examples of more than one in a single building. If you count them as Microsoft, Google, AWS then I'm not sure I can think of a single example off the top of my head. Only really possible if you start including players like IBM or Oracle in that list.
I think you're conflating things.
Those hypothetical hyperscalers can advertise their availability zones and deployment regions, but they do not claim they built the data centers. They provide a service, but they do not make broad claims on how they built infrastructure.
TFA explain what they're doing, they literally write this:
"In general you have three main choices: Greenfield buildout (...), Cage Colocation (getting a private space inside a provider's datacenter enclosed by mesh walls), or Rack colocation...
We chose the second option"
I don't know how much clearer they can be.