The reality nowadays is: the on-prem staff is covered in the colo fees, which is split between everyone coloing in the location and reasonably affordable. The software-level work above that has massively simplified over the past 15 years, and effectively rivals the volume of work it would take to run workloads in the cloud (do you think managing IAM and Terraform is free?)
No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.
In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.
Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.
There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.
I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.
Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...
servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.
Yes these things can be done and a lot cheaper than paying AWS.
I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.
Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.
For server issues; again, pretty easy. Just use iKVM/IPMI and iPXE to diagnose a faulty server. Again, using "remote-hands" from the CoLo provider can help fix problems if your staff does not have the skills.
I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.
'Networking was a substantial cost and required experimentation. We did not use DHCP as most enterprise switches don’t support it and we wanted public IPs for the nodes for convenient and performant access from our servers. While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks.'
Where does the switch choice come into whether you DHCP? Wth would you want public IPs.
So anyone can download 30 PB of data with ease of course.
Possibly to avoid needing NAT (or VPN) gateway that can handle 100Gbps.
Those IPs you end up assigning manually could be private ones or routable ones. If private, authorized traffic could be bridged onto the network by anything, such as a random computer with 2 NICs, one of which is connected eventually to the Internet and one of which is on the local network.
If public, a firewall can control access just as well as using NAT can.
And I think this would be a banger for IPv6 if they really "need" public IPs.
DHCP is really only worth it when your hosts are truly dynamic (i.e. not controlled by you). Otherwise it's a lot easier to handle IP allocation as part of the asset lifecycle process.
Heck even my house IoT network is all static IPs because at the small scale it's much more robust to not depend on my home router for address assignment - replacing a smart bulb is a big enough event, so DHCP is solely for bootstrapping in that case.
At the enterprise level unpacking a server and recording the asset IDs etc is the time to assign IP addresses.
It gets set approximately once when the server's automated Ubuntu installation runs, and I never think about it.
> Where does the switch choice come into whether you DHCP?
Perhaps from home routers which include I've.
> Wth would you want public IPs.
Why wouldn't you? They have a firewall.
Heh, that's a clever solution to the problem of managing storage through the full 10 year disk lifecycle.
I think following Backblaze's hard disk stats is enough at this point.
[0] https://www.backblaze.com/cloud-storage/resources/hard-drive...
I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.
1. Fail in the first X amount of time
2. Fail towards the end of their rated lifespan
So buying used drives doesn't seem like the worst idea to me. You've already filtered out the drivers that would fail early.
Disclaimer: I have no idea what I'm talking about
e.g. have someone show up to the datacenter with a grocery list of slot indices and a cart of fresh drives every few months.
I started my career when on-prem was the norm and remember so much trouble. When you have long-lived hardware, eventually, no matter how hard you try, you just start to treat it as a pet and state naturally accumulates. Then, as the hardware starts to be not good enough, you need to upgrade. There's an internal team that presents the "commodity" interface, so you have to pick out your new hardware from their list and get the cost approved (it's a lot harder to just spend a little more and get a little more). Then your projects are delayed by them racking the new hardware and you properly "un-petting" your pets so they can respawn on the new devices, etc.
Anyways, when cloud came along, I was like, yeah we're switching and never going back. Buuut, come to find out that's part of the master plan: it's a no-brainer good deal until you and everyone in your org/company/industry forgets HTF to rack their own hardware, and then it starts to go from no-brainer to brainer. And basically unless you start to pull back and rebuild that muscle, it will go from brainer to no-brainer bad deal. So thanks for building this muscle!
IIRC the wisdom of the time cloud started becoming popular was to always be on-prem and use cloud to scale up when demand spiked. But over time temporarily scaling up became permanent, and devs became reliant on instantly spawning new machines for things other than spikes in demand and now everyone defaults to cloud and treats it as the baseline. In the process we lost the grounding needed to assess the real cost of things and predictably the cost difference between cloud and on-prem has only widened.
I've heard that before but was never able to make sense of it. Overflowing into the cloud seems like a nightmare to manage, wouldn't overbuilding on-prem be cheaper than paying your infra team to straddle two environments?
I would never dream of running Docker in production. It seems so overly complicated. Also, since day one, I could never understand using a public registry for mission critical stuff. When I was learning Docker, I would unplug the network cable so I wouldn’t accidentally push my container online somewhere with all my data.
I totally get the concept at scale. I also get the concept of just shipping an application in a container. I also get the concept of self-hosting of just give me the container so I don’t have to think about how it all works.
However, the complexity of building the container, cleanup, deleting entries, environment variables, no SSH availability, even on Railway in the beginning, ambiguous where your container needs to be to be to even get it somewhere. Public registry or private registry.
Certainly most of it is my lack of knowledge of not sticking with it.
Just give me a VM and some firewall rules. Cloning VMs can be automated in so many different ways.
/rant
Docker is a monster that you have to treat as a pet. You've still got to pet it through stages of updating, monitoring, snapshots and networking. When the internal system breaks it's no different to a server collapsing.
Snapshots are a haircut for the monster, useful but can make things worse.
Sometimes you can afford not to have triple redundant 1000GB network or a simple single machine with raid may have acceptable down time.
They've spent years optimizing everything so their monthly costs are going to be much lower than what these guys managed on their first attempt. They'll be using less energy. They run their own internet backbones and infrastructure, they design their own hardware and source components directly from the best suppliers, they have exclusive deals with energy providers, etc. Every thing these guys did, AWS does way better. And yet they charge 40x more. AWS at cost price would probably be 60-80x less than what they charge; if not more. Cloudflare undercuts them a bit because they are smaller but they can do the same things. So do MS, Google, and everybody else.
This market is ripe for disruption. There should not be a need to shovel hundreds of billions per year into AWS revenue for the industry. The same business operating at 20% margins would be a game changer. And most of this stuff is commodity stuff. Why is there not more competition in this space driving pricing down aggressively? What's keeping competitors off the market?
I think it's very hard to make a claim that the market is not price-competitive. The problem is that most decision makers don't actually prioritize price, they prioritize support and the larger ecosystem. It's easy to find engineers with experience on the big 3 clouds and they will be able to pick up where the previous engineers left off. No CTO goes to sleep worrying about whether they being vendor-locked into AWS will result in catastrophic business failure tomorrow due to catastrophic hardware failure. There is a larger ecosystem - observability, FinOps tooling, cloud security tooling, managed databases, that are virtually guaranteed to support AWS, sometimes support GCP and Azure, and almost never support any of the other clouds.
It's questionable whether the current situation is really due to companies like Oracle and IBM being unable to compete on price and make strategic partnership deals to build out ecosystem support for their clouds. I think it's more likely that AWS/GCP/Azure "won" the cloud market, and that if regulators were worth a damn, they'd start to address the market concentration instead of ignoring it.
Honestly, I don't understand this line of thinking. You're not alone in it either, but it always reads so naive.
Like you can magically hand wave or it'd be so easy and get a competitor who can do it just as well and/or cheaper than AWS. But somehow these statements seem to ignore that you can't. Otherwise Digital Ocean would've, or cloudflare or <name-your-linnode-rackspace-startup>.
The industry funnels billions to become vendor locked in because AWS is simply that good - bar none.
If you look at https://ir.aboutamazon.com/news-release/news-release-details... , it says in 2024,
> AWS segment sales increased 19% year-over-year to $107.6 billion.
> AWS segment operating income was $39.8 billion, compared with operating income of $24.6 billion in 2023.
So about 59% margin, relative to costs. Everybody undercutting AWS is likely doing so at close enough to 20% margins that it makes no sense to fund a startup in the same space.
I would be shocked if S3 had less than a 100% margin at sticker price.
These guys had one of the most important prerequisites to make going on-prem easy - they didn't care about cross-site or reliability.
keep in mind that cost there is likely to also be including personnel. probably a significant fraction if you consider how many employees amazon's aws division has.
A previous project I worked on had relatively little traffic, and AWS costs were rather insane for that.
I proposed exploring OVH or DO and probably get costs down to 2 digits per month. Upper management would hear nothing of it - AWS was what they wanted, costs be damned. They were more protecting their own jobs than making a technical decision, I think.
> Upper management would hear nothing of it
No one knows how to plan ahead any more. It's all "agile" and hardware (and budgeting for it) isnt something most in management are capable of doing any more.
There is also then justifying the CapEx on a 5 year amortization schedule... the thing is even if you borrow that money at current rate (7 percent) you can still come out far ahead of AWS... It's a lot of math, and a lot of accounting (and the accountability that comes with it).
Your average CTO just doesn't have these skills.
The cloud providers have definitely optimized their pricing for maximum profit extraction. Costs are high, and in many cases it's not high enough to actually warrant changing infrstructure to cheaper alternatives.
Sticking with AWS / Azure / GCP carries other benefits, too. You're more likely to find engineers who are experienced with those cloud platforms over, say, OVH.
The majority of server wrangling work I spent dealing with OS updates and, most annoyingly, OpenStack. But that's something you can't escape even if you run your stuff in the cloud...
b) The seem tolerant of failures so it's not going to be anything like 5hrs/week of physical maintenance. It will be bursty though (eg. box died, time to replace it...) but assuming they have spares of everything sitting around / already racked it shouldn't be a big deal.
(for Hetzner in particular it was a massive pain when we were trying to get CPU quotas with them for other data operations, and we prob don't want to have it in Europe, but it's been pretty easy to negotiate good quotes on similar deals locally now that we've shown we can do it ourselves)
They'd most likely claim abuse and delete your data wholesale without notice
However, as they noted they use 100gbps capacity from their ISP.
:) just kidding but also curious where besides torrents
Also, it's some of the most boring footage where there's overwhelming amounts that's about the least desirable thing for humans to sit and watch every minute of.
Why send a human to do a machine's job?
obv as we took on this project the delta between our cluster and the next-best option got smaller, in part bc the ability to host it ourselves gives us negotiating leverage, but managed bucket products are fundamentally overspecced for simple pretraining dumps. glacier does a nice job fitting the needs of archival storage for a good cost, but there's nothing similar for ML needs atm.
Even at 10PB, the storage commit discounts won't be anywhere near 50%. Probably more like 10-20%, if that.
Tell me it’s only someone _like_ Rewind and not actually them! Quoting from the Privacy page they link in their header.
So my guess is either CCTV type of footage where there’s large gaps of motion / high GOP / big codec gains - or something like desktop recordings which are generally very low bit rate even though they can be high res. At that bitrate I can’t imagine it’s something like YouTube video. (Unrelated to the bitrate maybe it’s something like all older public domain videos). I would love to have an idea of what type of videos they are using (just out of curiosity)
Is it a public benefit corp?
What are y'all building?
We're working on pretraining computer action models from the ground up—hence the pretraining data cluster. We're a public benefit corp because we think its important for AGI to built in the public's interest + are planning on automating a lot of the work done on computers!
It goes without saying any data pre-processing needs to be done before writing, at the storage site, or on the training GPUs.
I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.
They got used drives. I wonder if they did any testing? I've gotten used drives that were DOA, which showed up in tests - SMART tests, short and long, then writing pseudorandom data to verify capacity.
w.r.t. testing the main thing we did was try to buy a bit from each supplier a month or two ahead of time, so by the time we were doing the full build that rack was a known variable. We did find one drive lot which was super sketchy and just didn't include it in the bulk orders later. diversity in suppliers helps a lot with tail risk
A lot of older toploaders from vendors like Dell are not tool-free. If you bought vendor drives and one fails, you RMA it and move on. However if you want to replace failed drives in the field, or want to go it alone from the start with refurbished drives... you'll be doing a lot of screwing. They're quite fragile and the plastic snaps easily. It's pretty tedious work.
https://www.theserverstore.com/supermicro-superstorage-ssg-6...
You can get a DC guy but then he doesn't have much to do post setup and if you contract that you're paying mondo dollars anyway to get it right and it's a market for lemons (lots of bullshitters out there who don't know anything).
Learned this lesson painfully.
this has been incredibly nice for our first hardware project, if we ever expand substantially then we'd def care more about the colo costs.
They probably still won’t budge on the egress fees.
There it is, the answer to how to mitigate brand damage when risking distance between themselves and some of those 9s.
At that rate I wonder if it makes sense to do a massive solar panel and battery installation. They're already hosting all of their compute and storage on prem, so why not bring electricity generation on prem as well?
Unless they go out of business in 3 years that seems unlikely to me. Is this a one-off model where they train once and it never needs to be updated?
Most venture-backed startups would have just paid the AWS or Cloudflare tax. I certainly hope your VCs appreciate how efficient you are being with their capital :)
So anyway you basically pretend you resold the drives today. Here they are assuming in 3 years time no one will pay anything for the drives. Somewhat reasonable to be honest since the setup's bespoke and you'll only get a fraction of the value of 3 year old drives if you resold them.
I'm curious about the process of getting colo space. Did you use a broker? Did you negotiate, and if so, how large was the difference in price between what you initially were quoted and what you ended up paying?
On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.
They apparently have zero need for redundancy in their use case, and the failure rate won't be high enough to take out a significant percentage of their capacity.
Plus, the TCO is already way under the cloud equiv. so might as well spend a little more to get something much newer and more reliable
> One thing we discovered very quickly was that [world cup] goals scored showed up in our monitoring graphs. This was very cool because not only is it neat to see real-world events show up in your systems, but this gave our team an excuse to watch soccer during meetings. We weren’t “watching soccer during meetings”, we were “proactively monitoring our systems’ performance.”
https://discord.com/blog/how-discord-stores-trillions-of-mes...
It is linked as evidence for Discord using "less than a petabyte" of storage for messages. My best guess is that they multiplied node size and count from this post, which comes out to 708 TB for the old cluster and 648 in the new setup (presumably it also has some space to grow)
So how many actual man hours for 2400 drives?
What do you plan to do if you start getting corruption and bitrot? The complexity of S3 comes with a lot of hard guarantees for data integrity.
I did go through this couple of years ago and we ended up with Ceph as well. Combine this with reusing existing hardware that was very suboptimal for Ceph in several ways, it was a pretty bad experience and in the end for our use case AWS was able to offer a good enough pricing that the performance and reliability of S3 was a better deal than managing it ourselves.
If I would do it again then I would make sure that I have the hardware setup that is ideal (plenty of SSD's for metadata, every spinning disk directly addressed as a single OSD, sound network topology and fast enough NIC's) and probably use Rook instead of cephadm. The monitoring, configuration and documentation side of Ceph is however still quite sad, it was really hard to figure out why something is slow and how to tune things faster.
That said, if the Enterprise options are performing better or you at least get good support for tuning and optimizing then the alternatives could be well worth consideration.
I do feel sorry for the friends that go suckered into doing a bunch of grunt work for free though
I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p
- your storage in one place
- you own all backup,
-- off site backup (hot or cold)
- uptime worries
- maintenance drives
-- how many can fail. before it is a problem
- maintenance machines
-- how many can fail. before it is a problem
- maintenance misc/datacenter
- What to do the electricity is cut off suddenly
-- do you have a backup provider?
-- disel generators?
-- giant batteries?
-- Will the backup power also run cooling?
-natural disaster
-- earthquake
-- flooding
-- heatwave
- physical security
- employee training / (esp. if many quit)
- backup for networking (and power for it)
- employees on call 24/7
- protection against hacking
+++++
I agree that a lot of cloud providers overcharge by a lot, but doing it all yourself gives you a lot of headaches.
co-hosting would seem like a valuable partial mitigator.
Things like "protection from hacking" also don't come from AWS.
While disks fail rate is already explored in another threads here, there is one related thing that catch my interest. Disk failure in such setup is not just cost of new disk + replacement cost (someone has to go there and change it!). It also inconvenience with dealing with failing requests. Ok, you are willing to lose 5% of your dataset. But are your '200-lines of code' robust enough to handle such cases. What if disk didn't fail, but start to be veeeeery slow. Does your training process can efficiently skip such bad objects. Do you have enough transparency to understand how much data you already lost? Is it still below 5%? And so on and so forth.
I feel like this article was written right after they built this construction and before let say 6 months of usage. Because I'm pretty sure their costs will go much higher than they calculated here. Especially if they start including hidden costs, like the work needed to be done on training side.
Yes, cost for self-hosting most probably still be less than aws (aws is not cheap). But it might start to be comparable with storage solutions of small ('neo') cloud providers if you buy gpu there.
Still fun for someone wanting to stick a computer in a DC though.
Networking is surprisingly hard but we also settled for the cheapo life QSFP instead of the new Cisco switches that do 800 Gbps that are coming. Great writeup.
One that would be fun is about the mechanics of layout and cabling and that sort of thing. Learning all that manually was a pain in the ass. It's not just written down somewhere and I should have done it when I was doing it but now I no longer am doing it and so can't provide good photos.
...but only if you compute the TCO. The bandwidth, peering, service contracts, available power, cooling, networking, rack capacity, half-decent smart hands, spare gear, etc, etc. The disks won't be the majority of your bill, and the logistics are difficult. It can still be cheaper than $CLOUD, but you have to deal with all the cost and complexity that comes with DIY, so do your homework first.
To adapt the old enterprise information infrastructure saying for startups:
"Nobody Ever Got Fired for Buying eBay"