Building the heap: racking 30 petabytes of hard drives for pretraining (opens in new tab)

(si.inc)

412 pointsnee1r8mo ago274 comments

274 comments

194 comments · 53 top-level

ttfvjktesd8mo ago· 12 in thread

The biggest part that is always missing in such comparisons is the employee salaries. In the calculation they give $354k/year of total cost per year. But now add the cost of staff in SF to operate that thing.

827a8mo ago

The biggest part missing from the opposing side is: Their view is very much rooted in the pre-Cloud hardware infrastructure world, where you'd pay sysadmins a full salary to sit in a dark room to monitor these servers.

The reality nowadays is: the on-prem staff is covered in the colo fees, which is split between everyone coloing in the location and reasonably affordable. The software-level work above that has massively simplified over the past 15 years, and effectively rivals the volume of work it would take to run workloads in the cloud (do you think managing IAM and Terraform is free?)

ttfvjktesd8mo ago

> do you think managing IAM and Terraform is free?

No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.

In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.

g413n8mo ago

yeah colo help has been great, we had a power blip and without any hassle they covered the cost and installation of UPSes for every rack, without us needing to think abt it outside of some email coordination.

Aurornis8mo ago

Small startup teams can sometimes get away with datacenter management being a side task that gets done on an as-needed basis at first. It will come with downtime and your stability won't be anywhere near as good as Cloudflare or AWS no matter how well you plan, though.

Every real-world colocation or self-hosting project I've ever been around has underestimate their downtime and rate of problems by at least an order of magnitude. The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

There is a false sense of security that comes in the early days of the project when you think you've gotten past the big issues and developed a system that's reliable enough. The real test is always 1-2 years later when teams have churned, systems have grown, and the initial enthusiasm for playing with hardware has given way to deep groans whenever the team has to draw straws to see who gets to debug the self-hosted server setup this time or, worse, drive to the datacenter again.

calvinmorrison8mo ago

> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

I don't have this experience at all. Our colo handled almost all work. the only time i ever went to the server farm was to build out whole new racks. Even replacing servers the colo handled for us at good cost.

Our reliability came from software not hardware, though of course we had hundreds of spares sitting by, the defense in depth (multiple datacenters, each datacenter having 2 'brains' which could hotswap, each client multiply backed up on 3-4 machines)...

servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.

Yes these things can be done and a lot cheaper than paying AWS.

1 more reply

g413n8mo ago

fwiw our first test rack has been up for about a year now and the full cluster has been operational for training for the past ~6 months. having it right down the block from our office has been incredibly helpful, I am a bit worried abt what e.g. freemont would look like if we expand there.

I think another big crux here is that there isn't really any notion of cluster-wide downtime, aside from e.g. a full datacenter power outage (which we've had ig, and now have UPSes in each rack kindly provided and installed by our datacenter). On the software/network level the storage isn't really coordinated in any manner, so failures of one machine only reflect as a degradation to the total theoretical bandwidth for training. This means that there's generally no scrambling and we can just schedule maintenance at our leisure. Last time I drew straws for maintenance I clocked a 30min round-trip to walk over and plug a crash cart into each of the 3 problematic machines to reboot and re-intialize and that was it.

Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.

rtp4me8mo ago

For drive issues, this is easy. Have a stack of replacements on hand and just open a "remote-hands" ticket with the CoLo provider to swap out the drive. This can usually be done in 1-2hrs from opening the ticket.

For server issues; again, pretty easy. Just use iKVM/IPMI and iPXE to diagnose a faulty server. Again, using "remote-hands" from the CoLo provider can help fix problems if your staff does not have the skills.

1 more reply

kabdib8mo ago

I've built and maintained similar setups (10PB range). Honestly, you just shove disks into it, and when they fail you replace them. You need folks around to handle things like controller / infrastructure failure, but hopefully you're paying them to do other stuff, too.

g413n8mo ago

someone has to go and power-cycle the machines every couple months it's chill, that's the point of not using ceph

ttfvjktesd8mo ago

You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.

I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.

2 more replies

datadrivenangel8mo ago

Assuming that they end up hiring a full time ops person at 500k annually total costs (250k base for a data center wizard), then that's 42k extra a month, or ~$70k. Still 200k per month lower than their next best offering.

1 more reply

paxys8mo ago

So the drives are never going to fail? PSUs are never going to burn out? You are never going to need to procure new parts? Negotiate with vendors?

2 more replies

trebligdivad8mo ago· 11 in thread

The networking stuff seems....odd.

'Networking was a substantial cost and required experimentation. We did not use DHCP as most enterprise switches don’t support it and we wanted public IPs for the nodes for convenient and performant access from our servers. While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks.'

Where does the switch choice come into whether you DHCP? Wth would you want public IPs.

mystifyingpoi8mo ago

It really feels like they wanted 30 PB of storage accessible over HTTP and literally nothing else. No redundancy, no NAT, dead simple nginx config + some code to track where to find which file on the filesystem. I like that.

matt-p8mo ago

This was not written by a network person, quite clearly. Hopefully it's just a misunderstanding, otherwise they do need someone with literally any clue about networks.

g413n8mo ago

yeah misunderstanding we'll update the post-- separately it's true that we aren't network specialists and the network wrangling was prob disproportionately hard for us/ shouldn't have taken so long.

2 more replies

giancarlostoro8mo ago

> Wth would you want public IPs.

So anyone can download 30 PB of data with ease of course.

pclmulqdq8mo ago

They didn't seem to want to use a router. Purpose-built 100 Gbps routers are a bit expensive, but you can also turn a computer into one.

flumpcakes8mo ago

Many switches are L3 capable, making them in effect a router. Considering their internet lines appear to be hooked up to their 100 Gbps switch, I'd guess this is one of the L3 ones.

buzer8mo ago

> Wth would you want public IPs.

Possibly to avoid needing NAT (or VPN) gateway that can handle 100Gbps.

xp848mo ago

No DHCP doesn't mean public IPs nor impact the need for NAT, it just means the hosts have to be explicitly configured with IP addresses, default gateways if they need egress, and DNS.

Those IPs you end up assigning manually could be private ones or routable ones. If private, authorized traffic could be bridged onto the network by anything, such as a random computer with 2 NICs, one of which is connected eventually to the Internet and one of which is on the local network.

If public, a firewall can control access just as well as using NAT can.

1 more reply

bombcar8mo ago

I don't know what they're doing, but Mikrotik can perhaps route that → https://mikrotik.com/product/ccr2216_1g_12xs_2xq#fndtn-testr... and is about the cost of their used thing.

And I think this would be a banger for IPv6 if they really "need" public IPs.

1 more reply

XorNot8mo ago

I mean generally above a certain size of deployment DHCP is much more trouble then it's worth.

DHCP is really only worth it when your hosts are truly dynamic (i.e. not controlled by you). Otherwise it's a lot easier to handle IP allocation as part of the asset lifecycle process.

Heck even my house IoT network is all static IPs because at the small scale it's much more robust to not depend on my home router for address assignment - replacing a smart bulb is a big enough event, so DHCP is solely for bootstrapping in that case.

At the enterprise level unpacking a server and recording the asset IDs etc is the time to assign IP addresses.

Symbiote8mo ago

I have static, public IPs across 80 or so servers.

It gets set approximately once when the server's automated Ubuntu installation runs, and I never think about it.

> Where does the switch choice come into whether you DHCP?

Perhaps from home routers which include I've.

> Wth would you want public IPs.

Why wouldn't you? They have a firewall.

g413n8mo ago· 9 in thread

No mention of disk failure rates? curious how it's holding up after a few months

dylan6048mo ago

I've mentioned this story before, but we had massive drive failures when bringing up multiple disk arrays. We get them racked on a friday afternoon, and then I wrote a quick and dirty shell script to read/write data back and forth between them over the weekend that was to kick in after they finished striping the raid arrays. By quick and dirty I mean there was no logging, and just a bunch of commands saved as .sh. Came in on Monday to find massive failures in all of the arrays, but no insight into when they failed during the stripe or during stressing them. It was close to 50% failure rate. Turned out to be a bad batch from the factory. Multiple customers of our vendor were complaining. All the drives were replaced by the manufacturer. It just delayed the storage being available to production. After that, not one of them failed in the next 12 months before I left for another job.

jeffrallen8mo ago

> next 12 months before I left for another job

Heh, that's a clever solution to the problem of managing storage through the full 10 year disk lifecycle.

bayindirh8mo ago

The disk failure rates are very low when compared to decade ago. I used to change more than a dozen disks every week a decade ago. Now it's an eyebrow raising event which I seldom see.

I think following Backblaze's hard disk stats is enough at this point.

gordonhart8mo ago

Backblaze reports an annual failure rate of 1.36% [0]. Since their cluster uses 2,400 drives, they would likely see ~32 failures a year (extra ~$4,000 annual capex, almost negligible).

[0] https://www.backblaze.com/cloud-storage/resources/hard-drive...

2 more replies

cjaackie8mo ago

They mentioned the cluster being used enterprise drives, I can see the desire to save money but agree, that is going to be one expensive mistake down the road.

I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.

jms558mo ago

If I remember correctly, most drives either:

1. Fail in the first X amount of time

2. Fail towards the end of their rated lifespan

So buying used drives doesn't seem like the worst idea to me. You've already filtered out the drivers that would fail early.

Disclaimer: I have no idea what I'm talking about

2 more replies

guywithahat8mo ago

Used drives make sense if maintaining your home server is a hobby. It's fun to diagnose and solve problem in home servers, and failing drives give me a reason to work on the server. (I'm only half-joking, it's kind of fun)

g413n8mo ago

in a datacenter context failure rates are just a remote-hands recurring cost so it's not too bad with front-loaders

e.g. have someone show up to the datacenter with a grocery list of slot indices and a cart of fresh drives every few months.

ClaireBookworm8mo ago

good point

jimmytucson8mo ago· 9 in thread

Just wanted to say, thanks for doing this! Now the old rant...

I started my career when on-prem was the norm and remember so much trouble. When you have long-lived hardware, eventually, no matter how hard you try, you just start to treat it as a pet and state naturally accumulates. Then, as the hardware starts to be not good enough, you need to upgrade. There's an internal team that presents the "commodity" interface, so you have to pick out your new hardware from their list and get the cost approved (it's a lot harder to just spend a little more and get a little more). Then your projects are delayed by them racking the new hardware and you properly "un-petting" your pets so they can respawn on the new devices, etc.

Anyways, when cloud came along, I was like, yeah we're switching and never going back. Buuut, come to find out that's part of the master plan: it's a no-brainer good deal until you and everyone in your org/company/industry forgets HTF to rack their own hardware, and then it starts to go from no-brainer to brainer. And basically unless you start to pull back and rebuild that muscle, it will go from brainer to no-brainer bad deal. So thanks for building this muscle!

g413n8mo ago

we're in a pretty unique situation in that very early on we fundamentally can't afford the hyperscaler clouds to cover operations, so we're forced to develop some expertise. turned out to be reasonably chill and we'll prob stick with it for the foreseeable future, but we have seen a little bit of the state-creep you mention so tbd.

nodja8mo ago

Yeah from memory on-prem was always cheaper, it just removed a lot of logistic obstacles and made everything convenient under one bill.

IIRC the wisdom of the time cloud started becoming popular was to always be on-prem and use cloud to scale up when demand spiked. But over time temporarily scaling up became permanent, and devs became reliant on instantly spawning new machines for things other than spikes in demand and now everyone defaults to cloud and treats it as the baseline. In the process we lost the grounding needed to assess the real cost of things and predictably the cost difference between cloud and on-prem has only widened.

luhn8mo ago

> IIRC the wisdom of the time cloud started becoming popular was to always be on-prem and use cloud to scale up when demand spiked.

I've heard that before but was never able to make sense of it. Overflowing into the cloud seems like a nightmare to manage, wouldn't overbuilding on-prem be cheaper than paying your infra team to straddle two environments?

2 more replies

matt-p8mo ago

Docker is amazing for forcing the machines not to be pets, seriously, a racked sever is just another K3 or K8 node (or whatever) and doesn't get the choice or ability of being petted. It's so nice. You could maybe of said the same about vm's but not really, the VM just became the pet, OK you could at least image/snapshot it but it's not the same.

iJohnDoe8mo ago

It’s interesting everyone having different experiences and those experiences drive what they do.

I would never dream of running Docker in production. It seems so overly complicated. Also, since day one, I could never understand using a public registry for mission critical stuff. When I was learning Docker, I would unplug the network cable so I wouldn’t accidentally push my container online somewhere with all my data.

I totally get the concept at scale. I also get the concept of just shipping an application in a container. I also get the concept of self-hosting of just give me the container so I don’t have to think about how it all works.

However, the complexity of building the container, cleanup, deleting entries, environment variables, no SSH availability, even on Railway in the beginning, ambiguous where your container needs to be to be to even get it somewhere. Public registry or private registry.

Certainly most of it is my lack of knowledge of not sticking with it.

Just give me a VM and some firewall rules. Cloning VMs can be automated in so many different ways.

/rant

1 more reply

doublerabbit8mo ago

I've found docker is as of a monstrous pet.

Docker is a monster that you have to treat as a pet. You've still got to pet it through stages of updating, monitoring, snapshots and networking. When the internal system breaks it's no different to a server collapsing.

Snapshots are a haircut for the monster, useful but can make things worse.

1 more reply

theideaofcoffee8mo ago

I'm not op, but thanks for this. Like I mentioned in another comment, the wholesale move to the cloud has caused so many skills to become atrophied. And it's good that someone is starting to exercise that skill again, like you said. The hyperscalers are mostly to blame for this, the marketing FUD being that you can't possibly do it yourself, there are too many things to keep track of, let us do it (while conveniently leaving out how eye-wateringly expensive they are in comparison).

tempest_8mo ago

The other thing the cloud does not let you do is make trade offs.

Sometimes you can afford not to have triple redundant 1000GB network or a simple single machine with raid may have acceptable down time.

1 more reply

ares6238mo ago

Wanna see us do it again?

jillesvangurp8mo ago· 9 in thread

AWS obviously does the same but better. That's why they are so rich. Their cost is a tiny percentage of their revenue. They buy cheap servers, and then run lots of vms on them. Each of which delivers 10s/100s of $ per month. That server pays for itself in revenue within weeks/months. And it will be in service until it stops working which could be over five years. Same with storage, networking, gpus, etc.

They've spent years optimizing everything so their monthly costs are going to be much lower than what these guys managed on their first attempt. They'll be using less energy. They run their own internet backbones and infrastructure, they design their own hardware and source components directly from the best suppliers, they have exclusive deals with energy providers, etc. Every thing these guys did, AWS does way better. And yet they charge 40x more. AWS at cost price would probably be 60-80x less than what they charge; if not more. Cloudflare undercuts them a bit because they are smaller but they can do the same things. So do MS, Google, and everybody else.

This market is ripe for disruption. There should not be a need to shovel hundreds of billions per year into AWS revenue for the industry. The same business operating at 20% margins would be a game changer. And most of this stuff is commodity stuff. Why is there not more competition in this space driving pricing down aggressively? What's keeping competitors off the market?

solatic8mo ago

There's a long tail of smaller clouds: Digital Ocean, Oracle, IBM, Linode/Akamai, not to mention server providers like OVH and Hetzner, and not to mention Chinese clouds like Tencent and Alibaba (which both have US regions), not to mention PaaS providers who run on their own hardware like Fly.io.

I think it's very hard to make a claim that the market is not price-competitive. The problem is that most decision makers don't actually prioritize price, they prioritize support and the larger ecosystem. It's easy to find engineers with experience on the big 3 clouds and they will be able to pick up where the previous engineers left off. No CTO goes to sleep worrying about whether they being vendor-locked into AWS will result in catastrophic business failure tomorrow due to catastrophic hardware failure. There is a larger ecosystem - observability, FinOps tooling, cloud security tooling, managed databases, that are virtually guaranteed to support AWS, sometimes support GCP and Azure, and almost never support any of the other clouds.

It's questionable whether the current situation is really due to companies like Oracle and IBM being unable to compete on price and make strategic partnership deals to build out ecosystem support for their clouds. I think it's more likely that AWS/GCP/Azure "won" the cloud market, and that if regulators were worth a damn, they'd start to address the market concentration instead of ignoring it.

irjustin8mo ago

> This market is ripe for disruption. here should not be a need to shovel hundreds of billions per year into AWS revenue for the industry. The same business operating at 20% margins would be a game changer.

Honestly, I don't understand this line of thinking. You're not alone in it either, but it always reads so naive.

Like you can magically hand wave or it'd be so easy and get a competitor who can do it just as well and/or cheaper than AWS. But somehow these statements seem to ignore that you can't. Otherwise Digital Ocean would've, or cloudflare or <name-your-linnode-rackspace-startup>.

The industry funnels billions to become vendor locked in because AWS is simply that good - bar none.

vasco8mo ago

Even Google kept considering killing GCP for years. Meta is going to burn a bunch more billions on cartoons that play in your glasses before they get into cloud hosting and the other existing competitors have huge issues crossing the chasm of "everything is an API" that AWS has.

computably8mo ago

> AWS at cost price would probably be 60-80x less than what they charge; if not more.

If you look at https://ir.aboutamazon.com/news-release/news-release-details... , it says in 2024,

> AWS segment sales increased 19% year-over-year to $107.6 billion.

> AWS segment operating income was $39.8 billion, compared with operating income of $24.6 billion in 2023.

So about 59% margin, relative to costs. Everybody undercutting AWS is likely doing so at close enough to 20% margins that it makes no sense to fund a startup in the same space.

rafaelmn8mo ago

AWS is everything from GPU farms and AI services to S3.

I would be shocked if S3 had less than a 100% margin at sticker price.

These guys had one of the most important prerequisites to make going on-prem easy - they didn't care about cross-site or reliability.

znpy8mo ago

> AWS segment operating income was $39.8 billion

keep in mind that cost there is likely to also be including personnel. probably a significant fraction if you consider how many employees amazon's aws division has.

silisili8mo ago

Inertia. It's the new 'nobody gets fired for buying..'

A previous project I worked on had relatively little traffic, and AWS costs were rather insane for that.

I proposed exploring OVH or DO and probably get costs down to 2 digits per month. Upper management would hear nothing of it - AWS was what they wanted, costs be damned. They were more protecting their own jobs than making a technical decision, I think.

zer00eyz8mo ago

AWS costs are insane for every project.

> Upper management would hear nothing of it

No one knows how to plan ahead any more. It's all "agile" and hardware (and budgeting for it) isnt something most in management are capable of doing any more.

There is also then justifying the CapEx on a 5 year amortization schedule... the thing is even if you borrow that money at current rate (7 percent) you can still come out far ahead of AWS... It's a lot of math, and a lot of accounting (and the accountability that comes with it).

Your average CTO just doesn't have these skills.

Maxion8mo ago

Unless the savings would be more than 100k Eur / 300k++ USD (I.e. total cost of one employee, it's not really worth it. Even then, moving to new infrastructure carries high risk for business disruptions which can cause an even bigger dip in revenue.

The cloud providers have definitely optimized their pricing for maximum profit extraction. Costs are high, and in many cases it's not high enough to actually warrant changing infrstructure to cheaper alternatives.

Sticking with AWS / Azure / GCP carries other benefits, too. You're more likely to find engineers who are experienced with those cloud platforms over, say, OVH.

mschuster918mo ago· 7 in thread

Shows how crazy cheap on prem can be. tips hat

nee1rOP8mo ago

tips hat back

stackskipton8mo ago

Not included is overhead of dealing with maintenance. S3/R2 generally don’t require OPS type dedicated to care and feeding. This type of setup will likely require someone to spend 5 hours a week dealing with it.

nee1rOP8mo ago

True, this is a large reason why we chose to have the datacenter a couple blocks away from the office.

mschuster918mo ago

I once had about three racks full of servers under my control, admittedly they weren't a ton of disks, but still the hardware maintenance effort was pretty much negligible over a few years (until it all went to the cloud).

The majority of server wrangling work I spent dealing with OS updates and, most annoyingly, OpenStack. But that's something you can't escape even if you run your stuff in the cloud...

1 more reply

hanikesn8mo ago

Why 5h a week? Just for hardware?

1 more reply

dpe828mo ago

a) 5hrs/week is negligible compared to that potential AWS bill.

b) The seem tolerant of failures so it's not going to be anything like 5hrs/week of physical maintenance. It will be bursty though (eg. box died, time to replace it...) but assuming they have spares of everything sitting around / already racked it shouldn't be a big deal.

buckle80178mo ago

And this is actually relatively expensive.

yread8mo ago· 6 in thread

You could get pretty close to the cost 1$/TB/month using Hetzner's sx135 with 8x22TB so 140TB in raidz1 for 240 eur. Maybe you get a better rate if you rent 200 of them. Someone else takes care of a lot of risks and you can sleep well at night

g413n8mo ago

yeah it's totally plausible that we go with something like this in the future. We have similar offers where we could separate out either the financing, the build-out, or both and just do the software.

(for Hetzner in particular it was a massive pain when we were trying to get CPU quotas with them for other data operations, and we prob don't want to have it in Europe, but it's been pretty easy to negotiate good quotes on similar deals locally now that we've shown we can do it ourselves)

mx7zysuj4xew8mo ago

You cannot use hetzner for anything serious.

They'd most likely claim abuse and delete your data wholesale without notice

fapjacks8mo ago

100% this. Hetzner has no problems completely blowing away whatever you've got running for arbitrary reasons. And their support is incresibly bad.

nodja8mo ago

I don't think Hetzner provides locations in SF. Those 100GBit connections don't do much if they need to connect outside the city the rest of the equipment is in, but maybe peering has gotten better and my views are outdated.

fuzzylightbulb8mo ago

You're good. The speed of light through a glass fiber is still just as slow as it ever was.

lostmsu8mo ago

Your math does not math. It is more like $2/TB/month with minimal redundancy.

nharada8mo ago· 5 in thread

So how do they get this data to the GPUs now...? Just run it over the public internet to the datacenter?

nee1rOP8mo ago

yeah, exactly! we have a 100G uplink, and then we use nginx secure links that we then just curl from the machines using HTTP. (funnily HTTPS adds overhead so we just pre-sign URLs)

g413n8mo ago

7.5k for zayo 100gig so that's like half of the MRC

bayindirh8mo ago

They can rent a dark fiber for themselves for that distance, and it'll be cheap.

However, as they noted they use 100gbps capacity from their ISP.

nee1rOP8mo ago

We want to get darkfiber from the datacenter to the office. I love 100Gbps

1 more reply

geor9e8mo ago

Does San Francisco really still have dark fiber? That 90s bubble sure did overshoot demand.

2 more replies

azinman28mo ago· 5 in thread

Where does one acquire 90M hours of video without being YouTube?

Barbing8mo ago

Anywhere as long as you can avoid “legal/practice/business slog”. Success is defined by $1.5b settlements.

:) just kidding but also curious where besides torrents

hengheng8mo ago

My guess is automated surveillance, which is also where this whole play has to be headed.

fuzzfactor8mo ago

Seems like that would be a good niche, not only for avoiding massive copyright considerations.

Also, it's some of the most boring footage where there's overwhelming amounts that's about the least desirable thing for humans to sit and watch every minute of.

Why send a human to do a machine's job?

Hobadee8mo ago

pr0n

NitpickLawyer8mo ago

"I swear I'm seeding those just so I get my ratio up" :)

1 more reply

coleca8mo ago· 4 in thread

For a workload of that size you would be able to negotiate private pricing with AWS or any cloud provider, not just CloudFlare. You can get a private pricing deal on S3 with as little as half a PB. Not saying that your overall expenses would be cheaper w/a CSP than DIY, but its not exactly an apples to apples comparison of taking full retail prices for the CSPs against eBayed equipment and free labor (minus the cost of the pizza).

g413n8mo ago

egress costs are the crux for AWS and they didn't budge when we tried to negotiate that we them, it's just entirely unusable for AI training otherwise. I think the cloudflare private quote is pretty representative of the cheaper end of managed object-bucket storage.

obv as we took on this project the delta between our cluster and the next-best option got smaller, in part bc the ability to host it ourselves gives us negotiating leverage, but managed bucket products are fundamentally overspecced for simple pretraining dumps. glacier does a nice job fitting the needs of archival storage for a good cost, but there's nothing similar for ML needs atm.

epistasis8mo ago

What sort of deal are you taking about? Would it be 50% or more?

master_crab8mo ago

You can get way higher than 50% discounts with AWS (or any cloud) depending upon the scale of the buy.

oasisbob8mo ago

Not for that minimum 0.5PB volume.

Even at 10PB, the storage commit discounts won't be anywhere near 50%. Probably more like 10-20%, if that.

not--felix8mo ago· 4 in thread

But where do you get 90 million hours worth of video data?

_1tem8mo ago

And not just any video data, they specifically mentioned screen recordings for agentic computer uses. A very specific kind of video. My guess is they have a partnership with someone like Rewind.ai

Barbing8mo ago

“For your privacy, your screen and audio recordings are stored locally and NEVER leave your Mac.”

Tell me it’s only someone _like_ Rewind and not actually them! Quoting from the Privacy page they link in their header.

1 more reply

bobbob19218mo ago

Assuming my calculation is accurate, 90,000,000 hours of video using around 30 PB comes to an average bit rate of about 760k. (Hard to guess though bc I doubt they’re using up all the space they provision day1)

So my guess is either CCTV type of footage where there’s large gaps of motion / high GOP / big codec gains - or something like desktop recordings which are generally very low bit rate even though they can be high res. At that bitrate I can’t imagine it’s something like YouTube video. (Unrelated to the bitrate maybe it’s something like all older public domain videos). I would love to have an idea of what type of videos they are using (just out of curiosity)

conception8mo ago

Arrr matey

Scramblejams8mo ago· 3 in thread

Fun piece, thanks to the author. But for vicarious thrills like this, more pictures are always appreciated!

echelon8mo ago

If the authors chime in, I'd like to ask what "Standard Intelligence PBC" does.

Is it a public benefit corp?

What are y'all building?

nee1rOP8mo ago

We did want more pictures!! Recently bought a Sony A7III to capture more fun moments like this.

We're working on pretraining computer action models from the ground up—hence the pretraining data cluster. We're a public benefit corp because we think its important for AGI to built in the public's interest + are planning on automating a lot of the work done on computers!

1 more reply

kid648mo ago

Many colos disallow photography.

boulos8mo ago· 3 in thread

It's quite cheap to just store data at rest, but I'm pretty confused by the training and networking set up here. It sounds like from other comments that you're not going to put the GPUs in the same location, so you'll be doing all training over X 100 Gbps lines between sites? Aren't you going to end up totally bottlenecked during pretraining here?

cornholio8mo ago

30PB / 100Gbps comes down to about a month, 4 links would give you a week, so that seems pretty quite acceptable for a training run, especially since you can overlap the initial loading of the array with the first training, i.e train as data becomes available.

It goes without saying any data pre-processing needs to be done before writing, at the storage site, or on the training GPUs.

g413n8mo ago

yeah we just have the 100gig link, atm that's about all the gpu clusters can pull but we'll prob expand bandwidth and storage as we scale.

I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.

mwambua8mo ago

How did you arrive at the decision of not putting the GPU machines in the colo? Were the power costs going to be too high? Or do you just expect to need more physical access to the GPU machines vs the storage ones?

1 more reply

pronoiac8mo ago· 3 in thread

I wonder if they'll go with "toploaders" - like Backblaze Storage Pods - later. They have better density and faster setup, as they don't have to screw in every drive.

They got used drives. I wonder if they did any testing? I've gotten used drives that were DOA, which showed up in tests - SMART tests, short and long, then writing pseudorandom data to verify capacity.

g413n8mo ago

yeah we're very interested in trying toploaders, we'll do a test rack next time we expand and switch to that if it goes well.

w.r.t. testing the main thing we did was try to buy a bit from each supplier a month or two ahead of time, so by the time we were doing the full build that rack was a known variable. We did find one drive lot which was super sketchy and just didn't include it in the bulk orders later. diversity in suppliers helps a lot with tail risk

joshvm8mo ago

"don't have to screw in every drive" is relative, but at least tool-less drive carriers are a thing now.

A lot of older toploaders from vendors like Dell are not tool-free. If you bought vendor drives and one fails, you RMA it and move on. However if you want to replace failed drives in the field, or want to go it alone from the start with refurbished drives... you'll be doing a lot of screwing. They're quite fragile and the plastic snaps easily. It's pretty tedious work.

tempest_8mo ago

Used Supermicro machines of this generation and very cheap (all things considered)

https://www.theserverstore.com/supermicro-superstorage-ssg-6...

urbandw311er8mo ago· 3 in thread

Well done! I love the honest write up and the “can do” attitude. Must have been a lot of fun too. Out of interest why do you think you made the mistake of buying 20x more drives than you needed instead of the denser storage that you mention? Was there a reason you opted for this?

g413n8mo ago

I think <2x more drives than needed, not 20x (24 vs 14TB), but the racks holding the drives could've been denser. Around the same cost in any case and our colo doesn't charge for space, so it's not a big deal and we were just going with what we were familiar with, but something to try.

urbandw311er8mo ago

Oops sorry, my bad! Great to read all about it - good luck with the project.

Tepix8mo ago

He did mention that it would have been a higher up-front cost.

fragmede8mo ago· 3 in thread

My question isn't why do it yourself. A quick back of the envelope math shows AWS being much more expensive. My question is why San Francisco? It's one of the most expensive real estate markets in the US (#2 residential, #1 commercial), and electricity is expensive. $0.71/KwH peak residential rate! A jaunt down 280 to San Jose's gonna be cheaper, at the expense of. having to take that drive to get hands on. But I'm sure you can find someone who's capable of running a DC that lives in San Jose and needs a job so the SF team doesn't have to commute down to South Bay. Now obviously there's something to be said for having the rack in the office, I know of at least two (three, now) in San Francisco, it just seems like a weird decision if you're already worrying about money to the point of not using AWS.

hnav8mo ago

Article says their recurring cost is $17.5k, they'll spend at least that amount in terms of human time tending to their cluster if they have to drive to it. It's also a question of magnitudes, going from $0.5m/mo to $0.05m/mo (hard costs plus the extra headaches of dealing with cluster) is an order of magnitude, even if you could cut another order of magnitude it wouldn't be as impactful.

renewiltord8mo ago

Problem when you self-roll this is that you inevitably make mistakes and the cycle time of going down and up ruins everything. Access trumps everything.

You can get a DC guy but then he doesn't have much to do post setup and if you contract that you're paying mondo dollars anyway to get it right and it's a market for lemons (lots of bullshitters out there who don't know anything).

Learned this lesson painfully.

g413n8mo ago

it's not just in sf it's across the street from our office

this has been incredibly nice for our first hardware project, if we ever expand substantially then we'd def care more about the colo costs.

drnick18mo ago· 3 in thread

Everyone should give AWS the middle finger and start doing this. Beyond cost, it's a matter of sovereignty over one's computing and data.

twoodfin8mo ago

If this is a real market, I’d expect AWS to introduce S3 Junkyard with a similar durability and cost structure.

They probably still won’t budge on the egress fees.

g413n8mo ago

we would be so down to buy s3 junkyard tbh we were going around begging various storage clouds to offer us this before giving up and building it ourselves

Barbing8mo ago

>S3 Junkyard

There it is, the answer to how to mitigate brand damage when risking distance between themselves and some of those 9s.

landryraccoon8mo ago· 3 in thread

Their electricity costs are $10K per month or about $120K per year. At an interest rate of 7% that's $1.7M of capital tied up in power bills.

At that rate I wonder if it makes sense to do a massive solar panel and battery installation. They're already hosting all of their compute and storage on prem, so why not bring electricity generation on prem as well?

datadrivenangel8mo ago

At 120K per year over the three year accounting life of the hardware, that's 360k... how do you get to 1.7M?

landryraccoon8mo ago

It seems unlikely to me that they'll never have to retrain their model to account for new data. Is the assumption that their power usage drastically drops after 3 years?

Unless they go out of business in 3 years that seems unlikely to me. Is this a one-off model where they train once and it never needs to be updated?

moffkalast8mo ago

Let's just say we're not seeing all of these sudden private nuclear reactor investments for no reason.

yodon8mo ago· 3 in thread

Any startup that has enough money to casually buy a two-letter domain name has too much money, period. Kind of like counting the number of Aeron chairs at startups of old. Not a good sign.

638mo ago

Looks like unregistered two letter .inc domains are going for $2300/yr. Certainly <5% of the cost of a single developer.

lmm8mo ago

It's a .inc domain name, are those worth anything?

dangoodmanUT8mo ago

it's .inc... those aren't expensive

miniman13378mo ago· 3 in thread

Used Disks, No DR, not exactly a real shoot out.

nee1rOP8mo ago

True, though this is specifically for pretraining data (S3 wouldn't sell us used disk + no DR storage).

Sanzig8mo ago

I do appreciate the scrappiness of your solution. Used drives for a storage cluster is like /r/homelab on steroids. And since it's pretraining data, I suppose data integrity isn't critical.

Most venture-backed startups would have just paid the AWS or Cloudflare tax. I certainly hope your VCs appreciate how efficient you are being with their capital :)

1 more reply

p_ing8mo ago

You're in a seismically active part of the world. Will the venture last in a total loss scenario?

2 more replies

zparky8mo ago· 3 in thread

$125/disk, 12k/mo depreciation cost which i assume means disk failures, so ~100 disks/mo or 1200/yr, which is half of their disks a year - seems like a lot.

AnotherGoodName8mo ago

It's an accounting term. You need to report the value of assets of your company each reporting cycle. This allows you to report company profit more accurately since the 2400 drives aren't likely not worth what the company originally paid. It's stated as a tax write-off but people get confused with that term (they think X written off == X less tax paid). It's better to correctly state it as a way to more accurately report profit (which may end up with less company tax paid but obviously not 1:1 since company tax is not 100%).

So anyway you basically pretend you resold the drives today. Here they are assuming in 3 years time no one will pay anything for the drives. Somewhat reasonable to be honest since the setup's bespoke and you'll only get a fraction of the value of 3 year old drives if you resold them.

zparky8mo ago

oh i see, thanks! i might be too used to reading backblaze reports :p

devanshp8mo ago

no, we wanted to be conservative by depreciating somewhat more aggressively than that. we have much closer to 5% yearly disk failure rates.

jonas218mo ago· 2 in thread

Nice writeup. All of the technical detail is great!

I'm curious about the process of getting colo space. Did you use a broker? Did you negotiate, and if so, how large was the difference in price between what you initially were quoted and what you ended up paying?

nee1rOP8mo ago

We reached out to almost every colocation space in SF/some in Fremont to get quotes. There wasn't a difference between the quote price and what we ended up paying, though we did negotiate terms + one-time costs.

toomuchtodo8mo ago

Please consider posting the quotes, even if you have to redact colo names.

intalentive8mo ago· 2 in thread

“Solve computer use” and previous work is audio conversation model. How do these go together? Is the idea to replace keyboard and mouse with spoken commands? a la Star Trek

g413n8mo ago

just general research work. Once the recipes are efficient enough the modality is a smaller detail.

On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.

nerpderp828mo ago

Make me transparent aluminum!

supermatt8mo ago· 2 in thread

Where does one get “90 million hours of video data”?

hmcamp8mo ago

I’m also curious about this. I don’t recall seeing that mentioned in the article

supermatt8mo ago

Its in the first sentence: "We built a storage cluster in downtown SF to store 90 million hours worth of video data."

1 more reply

tarasglek8mo ago· 2 in thread

i am still confused what their software stack is, they dont use ceph but bought netapp, so they use nfs?

OliverGuy8mo ago

The NetApps are just disk shelves, can plug it into a SAS controller and use whatever software stack you please.

tarasglek8mo ago

but they have multiple head nodes, so its some distributed setup or just active/passive type thing?

3 more replies

pighive8mo ago· 2 in thread

HDDs - are never one time costs. Do datacenters also offer ordering and replacing HDDs?

epistasis8mo ago

With 30PB it's likely they will simply let capacity fall as drives fail.

They apparently have zero need for redundancy in their use case, and the failure rate won't be high enough to take out a significant percentage of their capacity.

Symbiote8mo ago

They offer replacing, yes, but normally expect you to order the new one. (Usually covered by a warranty, sent next business day.)

OutOfHere8mo ago· 2 in thread

Is it correct that you have zero data redundancy? This may work for you if you're just hoarding videos from YouTube, but not for most people who require an assurance that their data is safe. Even for you, it may hurt proper benchmarking, reproducibility, and multi-iteration training if the parent source disappears.

nee1rOP8mo ago

Definitely much less redundancy, this was definitely a tradeoff we made for pretraining data and cost.

Sanzig8mo ago

Did you do any kind of redundancy at least (eg: putting every 10 disks in RAID 5 or RAID Z1)? Or I suppose your training application doesn't mind if you shed a few terabytes of data every so often?

1 more reply

huxley_marvit8mo ago· 2 in thread

damn this is cool as hell. estimate on the maintenance cost in person-hours/month?

nee1rOP8mo ago

Around 2-5 hours/month, mostly powercycling the servers and replacing hard drives

Symbiote8mo ago

You should be able to power cycle the servers from their management interfaces.

(But I have the luxury of everything being bought new from HP, so the interfaces are similar.)

OliverGuy8mo ago· 2 in thread

Aren't those netapp shelves pretty old at this point? See a lot of people recommending against them even for homelab type uses. You can get those 60 drive SuperMicro JBODs for pretty cheap now, and those aren't too old, would have been my choice.

Plus, the TCO is already way under the cloud equiv. so might as well spend a little more to get something much newer and more reliable

g413n8mo ago

yeah it's on the wishlist to try

bobbob19218mo ago

Thanks to op for actually replying to the various comments here - really appreciate that (and for the initial of course!)

synack8mo ago· 2 in thread

IPMI is great and all, but I still prefer serial ports and remote PDUs. Never met a BMC I could trust.

toast08mo ago

Serial over IPMI, plus ipmi power control is pretty good when it works. Supermicro X10 and newer was pretty nice. X9 and X8 not as nice; it's not helpful when the serial over ipmi drops during reboot and doesn't come back in a reasonable amount of time, and then the graphical mode needs ancient java webstart with os and platform specific jni, oof.

jeffrallen8mo ago

Try Lenovo. Their BMCs Don't Suck (tm).

leejaeho8mo ago· 2 in thread

how long do you think it'll be before you fill all of it and have to build another cluster LOL

nee1rOP8mo ago

Already filled up and looking to possibly copy and paste :)

giancarlostoro8mo ago

So, others have asked, and I'm curious myself are you sourcing the videos yourselves or third parties?

1 more reply

lucb1e8mo ago· 1 in thread

The linked Discord post is also interesting and fun to read. Most of the post is more serious but this is one of the small gems:

> One thing we discovered very quickly was that [world cup] goals scored showed up in our monitoring graphs. This was very cool because not only is it neat to see real-world events show up in your systems, but this gave our team an excuse to watch soccer during meetings. We weren’t “watching soccer during meetings”, we were “proactively monitoring our systems’ performance.”

https://discord.com/blog/how-discord-stores-trillions-of-mes...

It is linked as evidence for Discord using "less than a petabyte" of storage for messages. My best guess is that they multiplied node size and count from this post, which comes out to 708 TB for the old cluster and 648 in the new setup (presumably it also has some space to grow)

g413n8mo ago

yeah we weren't sure about putting that number esp whether it includes all the image attachments, but in any case it's at least around the right reference class for the largest text data operations.

archmaster8mo ago· 1 in thread

Had the pleasure of helping rack drives! Nothing more fun than an insane amount of data :P

nee1rOP8mo ago

Thanks for helping!!!

htrp8mo ago· 1 in thread

>We threw a hard drive stacking party in downtown SF and got our friends to come, offering food and custom-engraved hard drives to all who helped. The hard drive stacking started at 6am and continued for 36 hours (with a break to sleep), and by the end of that time we had 30 PB of functioning hardware racked and wired up.

So how many actual man hours for 2400 drives?

g413n8mo ago

around 250

RagnarD8mo ago· 1 in thread

I love this story. This is true hacking and startup cost awareness.

nee1rOP8mo ago

Thanks!! :)

Onavo8mo ago· 1 in thread

> We kept this obsessively simple instead of using MinIO or Ceph because we didn’t need any of the features they provided; it’s much, much simpler to debug a 200-line program than to debug Ceph, and we weren’t worried about redundancy or sharding. All our drives were formatted with XFS.

What do you plan to do if you start getting corruption and bitrot? The complexity of S3 comes with a lot of hard guarantees for data integrity.

g413n8mo ago

our training stack doesn't make strong assumptions about data integrity, it's chill

jmakov8mo ago· 1 in thread

Wonder why everybody's first pick is CEPH which is known for being hard to optimize vs e.g. SeaweedFS

jnsaff28mo ago

If I'd have to guess then I would think that Ceph is the only one who is truly open source and does not feature gate important parts to paid enterprise users.

I did go through this couple of years ago and we ended up with Ceph as well. Combine this with reusing existing hardware that was very suboptimal for Ceph in several ways, it was a pretty bad experience and in the end for our use case AWS was able to offer a good enough pricing that the performance and reliability of S3 was a better deal than managing it ourselves.

If I would do it again then I would make sure that I have the hardware setup that is ideal (plenty of SSD's for metadata, every spinning disk directly addressed as a single OSD, sound network topology and fast enough NIC's) and probably use Rook instead of cephadm. The monitoring, configuration and documentation side of Ceph is however still quite sad, it was really hard to figure out why something is slow and how to tune things faster.

That said, if the Enterprise options are performing better or you at least get good support for tuning and optimizing then the alternatives could be well worth consideration.

Havoc8mo ago· 1 in thread

Cool write-up.

I do feel sorry for the friends that go suckered into doing a bunch of grunt work for free though

g413n8mo ago

yeah that's why we started paying people near the second half- not super clearly stated in the blogpost, but the novelty definitely wore off with plenty of drives left to stack, so we switched strategies to get it done in time.

I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p

akreal8mo ago· 1 in thread

How is/was the data written to disks? Something like rsync/netcat?

nee1rOP8mo ago

We use the same nginx rust server to do file writes, it's done via web requests

g413n8mo ago· 1 in thread

the doodles are great

nee1rOP8mo ago

Thanks! Lots of hard work went into them.

ThinkBeat8mo ago· 1 in thread

So now you have all

- your storage in one place

- you own all backup,

-- off site backup (hot or cold)

- uptime worries

- maintenance drives

-- how many can fail. before it is a problem

- maintenance machines

-- how many can fail. before it is a problem

- maintenance misc/datacenter

- What to do the electricity is cut off suddenly

-- do you have a backup provider?

-- disel generators?

-- giant batteries?

-- Will the backup power also run cooling?

-natural disaster

-- earthquake

-- flooding

-- heatwave

- physical security

- employee training / (esp. if many quit)

- backup for networking (and power for it)

- employees on call 24/7

- protection against hacking

+++++

I agree that a lot of cloud providers overcharge by a lot, but doing it all yourself gives you a lot of headaches.

co-hosting would seem like a valuable partial mitigator.

pclmulqdq8mo ago

Most of these come from your colo provider (including a good backup power and networking story), and you can pay remote hands for a lot of the rest.

Things like "protection from hacking" also don't come from AWS.

Zvez8mo ago

That's basically scaled up story of 'I store my files on my computer and it is 10x cheaper than using dropbox'

While disks fail rate is already explored in another threads here, there is one related thing that catch my interest. Disk failure in such setup is not just cost of new disk + replacement cost (someone has to go there and change it!). It also inconvenience with dealing with failing requests. Ok, you are willing to lose 5% of your dataset. But are your '200-lines of code' robust enough to handle such cases. What if disk didn't fail, but start to be veeeeery slow. Does your training process can efficiently skip such bad objects. Do you have enough transparency to understand how much data you already lost? Is it still below 5%? And so on and so forth.

I feel like this article was written right after they built this construction and before let say 6 months of usage. Because I'm pretty sure their costs will go much higher than they calculated here. Especially if they start including hidden costs, like the work needed to be done on training side.

Yes, cost for self-hosting most probably still be less than aws (aws is not cheap). But it might start to be comparable with storage solutions of small ('neo') cloud providers if you buy gpu there.

renewiltord8mo ago

The cost difference is huge. Modern compute is just so much bigger than one would think. Hurricane Electric is incredibly cheap too. And Digital Realty in the city are pretty good. The funny thing is that the Monkeybrains guys will make room for you at $75/amp but that isn't competitive when a 9654 based system pulls 2+ amps at peak.

Still fun for someone wanting to stick a computer in a DC though.

Networking is surprisingly hard but we also settled for the cheapo life QSFP instead of the new Cisco switches that do 800 Gbps that are coming. Great writeup.

One that would be fun is about the mechanics of layout and cabling and that sort of thing. Learning all that manually was a pain in the ass. It's not just written down somewhere and I should have done it when I was doing it but now I no longer am doing it and so can't provide good photos.

0xbadcafebee8mo ago

For massive amounts of high-performance storage, the cloud is absolutely the most expensive option, by far. Even just 100+TB is ridiculously expensive on any cloud provider. If your company revolves around large amounts of data, it can make sense to keep it on-prem...

...but only if you compute the TCO. The bandwidth, peering, service contracts, available power, cooling, networking, rack capacity, half-decent smart hands, spare gear, etc, etc. The disks won't be the majority of your bill, and the logistics are difficult. It can still be cheaper than $CLOUD, but you have to deal with all the cost and complexity that comes with DIY, so do your homework first.

miltonlost8mo ago

And how much did the training data cost?

speransky8mo ago

Why re-invent the wheel instead of using Lustre filesystem? It's easy to deploy on such a small filesystem; it is easy enough. POSIX interface, multiple clients, supports high-speed networking...

ClaireBookworm8mo ago

great write up, really appreciate the explanations / showing the process

neilv8mo ago

As a fan of eBay for homelab gear, I appreciate the can-do scrappiness of doing it for a startup.

To adapt the old enterprise information infrastructure saying for startups:

"Nobody Ever Got Fired for Buying eBay"

ThrowawayTestr8mo ago

DIY is always cheaper than paying someone else. Great write-up.

Hobadee8mo ago

While I don't completely disagree with all the downsides of Ceph, it also sounds like they haven't heard of Croit. I set it up at my last company and it's amazing. It takes like 99% of the Ceph headaches away, plus you get Ceph experts to talk to.

alchemist1e98mo ago

Would have been much easier and probably cheaper to buy gear from 45drives.

winterrx8mo ago

Great piece, thanks for the write up.

lisbbb8mo ago

Garbage in, garbage out, 30 petabyte edition

j / k navigate · click thread line to collapse

274 comments

194 comments · 53 top-level

ttfvjktesd8mo ago· 12 in thread

827a8mo ago

ttfvjktesd8mo ago

> do you think managing IAM and Terraform is free?

No, but I would argue that a SaaS offering, where the whole maintenance of the storage system is maintained for you actually requires less maintenance hours than hosting 30 PB in a colo.

In terraform you define the S3 bucket and run terraform apply. Afterwards the company's credit card is the limit. Setting up and operating 30 PB yourself is an entirely different story.

g413n8mo ago

Aurornis8mo ago

calvinmorrison8mo ago

> The amount of time lost to driving to the datacenter, waiting for replacement parts to arrive, and scrambling to patch over unexpected failure modes is always much higher than expected.

servers going down were fairly common place, servers dying were commonplace. i think once we had a whole rack outage when the switch died, and we flipped it to the backup.

Yes these things can be done and a lot cheaper than paying AWS.

1 more reply

g413n8mo ago

Again having it right by the office is super nice, we'll need to really trust our kvm setup before considering anything offsite.

rtp4me8mo ago

1 more reply

kabdib8mo ago

g413n8mo ago

someone has to go and power-cycle the machines every couple months it's chill, that's the point of not using ceph

ttfvjktesd8mo ago

You are under the assumption that only Ceph (and similar complex software) requires staff, whereas plain 30 PB can be operated basically just by rebooting from time to time.

I think that anyone with actual experience of operating thousands of physical disks in datacenters would challenge this assumption.

2 more replies

datadrivenangel8mo ago

1 more reply

paxys8mo ago

So the drives are never going to fail? PSUs are never going to burn out? You are never going to need to procure new parts? Negotiate with vendors?

2 more replies

trebligdivad8mo ago· 11 in thread

The networking stuff seems....odd.

Where does the switch choice come into whether you DHCP? Wth would you want public IPs.

mystifyingpoi8mo ago

matt-p8mo ago

This was not written by a network person, quite clearly. Hopefully it's just a misunderstanding, otherwise they do need someone with literally any clue about networks.

g413n8mo ago

yeah misunderstanding we'll update the post-- separately it's true that we aren't network specialists and the network wrangling was prob disproportionately hard for us/ shouldn't have taken so long.

2 more replies

giancarlostoro8mo ago

> Wth would you want public IPs.

So anyone can download 30 PB of data with ease of course.

pclmulqdq8mo ago

They didn't seem to want to use a router. Purpose-built 100 Gbps routers are a bit expensive, but you can also turn a computer into one.

flumpcakes8mo ago

Many switches are L3 capable, making them in effect a router. Considering their internet lines appear to be hooked up to their 100 Gbps switch, I'd guess this is one of the L3 ones.

buzer8mo ago

> Wth would you want public IPs.

Possibly to avoid needing NAT (or VPN) gateway that can handle 100Gbps.

xp848mo ago

No DHCP doesn't mean public IPs nor impact the need for NAT, it just means the hosts have to be explicitly configured with IP addresses, default gateways if they need egress, and DNS.

If public, a firewall can control access just as well as using NAT can.

1 more reply

bombcar8mo ago

I don't know what they're doing, but Mikrotik can perhaps route that → https://mikrotik.com/product/ccr2216_1g_12xs_2xq#fndtn-testr... and is about the cost of their used thing.

And I think this would be a banger for IPv6 if they really "need" public IPs.

1 more reply

XorNot8mo ago

I mean generally above a certain size of deployment DHCP is much more trouble then it's worth.

DHCP is really only worth it when your hosts are truly dynamic (i.e. not controlled by you). Otherwise it's a lot easier to handle IP allocation as part of the asset lifecycle process.

At the enterprise level unpacking a server and recording the asset IDs etc is the time to assign IP addresses.

Symbiote8mo ago

I have static, public IPs across 80 or so servers.

It gets set approximately once when the server's automated Ubuntu installation runs, and I never think about it.

> Where does the switch choice come into whether you DHCP?

Perhaps from home routers which include I've.

> Wth would you want public IPs.

Why wouldn't you? They have a firewall.

g413n8mo ago· 9 in thread

No mention of disk failure rates? curious how it's holding up after a few months

dylan6048mo ago

jeffrallen8mo ago

> next 12 months before I left for another job

Heh, that's a clever solution to the problem of managing storage through the full 10 year disk lifecycle.

bayindirh8mo ago

The disk failure rates are very low when compared to decade ago. I used to change more than a dozen disks every week a decade ago. Now it's an eyebrow raising event which I seldom see.

I think following Backblaze's hard disk stats is enough at this point.

gordonhart8mo ago

Backblaze reports an annual failure rate of 1.36% [0]. Since their cluster uses 2,400 drives, they would likely see ~32 failures a year (extra ~$4,000 annual capex, almost negligible).

[0] https://www.backblaze.com/cloud-storage/resources/hard-drive...

2 more replies

cjaackie8mo ago

They mentioned the cluster being used enterprise drives, I can see the desire to save money but agree, that is going to be one expensive mistake down the road.

I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.

jms558mo ago

If I remember correctly, most drives either:

1. Fail in the first X amount of time

2. Fail towards the end of their rated lifespan

So buying used drives doesn't seem like the worst idea to me. You've already filtered out the drivers that would fail early.

Disclaimer: I have no idea what I'm talking about

2 more replies

guywithahat8mo ago

g413n8mo ago

in a datacenter context failure rates are just a remote-hands recurring cost so it's not too bad with front-loaders

e.g. have someone show up to the datacenter with a grocery list of slot indices and a cart of fresh drives every few months.

ClaireBookworm8mo ago

good point

jimmytucson8mo ago· 9 in thread

Just wanted to say, thanks for doing this! Now the old rant...

g413n8mo ago

nodja8mo ago

Yeah from memory on-prem was always cheaper, it just removed a lot of logistic obstacles and made everything convenient under one bill.

luhn8mo ago

> IIRC the wisdom of the time cloud started becoming popular was to always be on-prem and use cloud to scale up when demand spiked.

2 more replies

matt-p8mo ago

iJohnDoe8mo ago

It’s interesting everyone having different experiences and those experiences drive what they do.

Certainly most of it is my lack of knowledge of not sticking with it.

Just give me a VM and some firewall rules. Cloning VMs can be automated in so many different ways.

/rant

1 more reply

doublerabbit8mo ago

I've found docker is as of a monstrous pet.

Snapshots are a haircut for the monster, useful but can make things worse.

1 more reply

theideaofcoffee8mo ago

tempest_8mo ago

The other thing the cloud does not let you do is make trade offs.

Sometimes you can afford not to have triple redundant 1000GB network or a simple single machine with raid may have acceptable down time.

1 more reply

ares6238mo ago

Wanna see us do it again?

jillesvangurp8mo ago· 9 in thread

solatic8mo ago

irjustin8mo ago

Honestly, I don't understand this line of thinking. You're not alone in it either, but it always reads so naive.

The industry funnels billions to become vendor locked in because AWS is simply that good - bar none.

vasco8mo ago

computably8mo ago

> AWS at cost price would probably be 60-80x less than what they charge; if not more.

If you look at https://ir.aboutamazon.com/news-release/news-release-details... , it says in 2024,

> AWS segment sales increased 19% year-over-year to $107.6 billion.

> AWS segment operating income was $39.8 billion, compared with operating income of $24.6 billion in 2023.

So about 59% margin, relative to costs. Everybody undercutting AWS is likely doing so at close enough to 20% margins that it makes no sense to fund a startup in the same space.

rafaelmn8mo ago

AWS is everything from GPU farms and AI services to S3.

I would be shocked if S3 had less than a 100% margin at sticker price.

These guys had one of the most important prerequisites to make going on-prem easy - they didn't care about cross-site or reliability.

znpy8mo ago

> AWS segment operating income was $39.8 billion

keep in mind that cost there is likely to also be including personnel. probably a significant fraction if you consider how many employees amazon's aws division has.

silisili8mo ago

Inertia. It's the new 'nobody gets fired for buying..'

A previous project I worked on had relatively little traffic, and AWS costs were rather insane for that.

zer00eyz8mo ago

AWS costs are insane for every project.

> Upper management would hear nothing of it

No one knows how to plan ahead any more. It's all "agile" and hardware (and budgeting for it) isnt something most in management are capable of doing any more.

Your average CTO just doesn't have these skills.

Maxion8mo ago

Sticking with AWS / Azure / GCP carries other benefits, too. You're more likely to find engineers who are experienced with those cloud platforms over, say, OVH.

mschuster918mo ago· 7 in thread

Shows how crazy cheap on prem can be. tips hat

nee1rOP8mo ago

tips hat back

stackskipton8mo ago

nee1rOP8mo ago

True, this is a large reason why we chose to have the datacenter a couple blocks away from the office.

mschuster918mo ago

The majority of server wrangling work I spent dealing with OS updates and, most annoyingly, OpenStack. But that's something you can't escape even if you run your stuff in the cloud...

1 more reply

hanikesn8mo ago

Why 5h a week? Just for hardware?

1 more reply

dpe828mo ago

a) 5hrs/week is negligible compared to that potential AWS bill.

buckle80178mo ago

And this is actually relatively expensive.

yread8mo ago· 6 in thread

g413n8mo ago

mx7zysuj4xew8mo ago

You cannot use hetzner for anything serious.

They'd most likely claim abuse and delete your data wholesale without notice

fapjacks8mo ago

100% this. Hetzner has no problems completely blowing away whatever you've got running for arbitrary reasons. And their support is incresibly bad.

nodja8mo ago

fuzzylightbulb8mo ago

You're good. The speed of light through a glass fiber is still just as slow as it ever was.

lostmsu8mo ago

Your math does not math. It is more like $2/TB/month with minimal redundancy.

nharada8mo ago· 5 in thread

So how do they get this data to the GPUs now...? Just run it over the public internet to the datacenter?

nee1rOP8mo ago

yeah, exactly! we have a 100G uplink, and then we use nginx secure links that we then just curl from the machines using HTTP. (funnily HTTPS adds overhead so we just pre-sign URLs)

g413n8mo ago

7.5k for zayo 100gig so that's like half of the MRC

bayindirh8mo ago

They can rent a dark fiber for themselves for that distance, and it'll be cheap.

However, as they noted they use 100gbps capacity from their ISP.

nee1rOP8mo ago

We want to get darkfiber from the datacenter to the office. I love 100Gbps

1 more reply

geor9e8mo ago

Does San Francisco really still have dark fiber? That 90s bubble sure did overshoot demand.

2 more replies

azinman28mo ago· 5 in thread

Where does one acquire 90M hours of video without being YouTube?

Barbing8mo ago

Anywhere as long as you can avoid “legal/practice/business slog”. Success is defined by $1.5b settlements.

:) just kidding but also curious where besides torrents

hengheng8mo ago

My guess is automated surveillance, which is also where this whole play has to be headed.

fuzzfactor8mo ago

Seems like that would be a good niche, not only for avoiding massive copyright considerations.

Also, it's some of the most boring footage where there's overwhelming amounts that's about the least desirable thing for humans to sit and watch every minute of.

Why send a human to do a machine's job?

Hobadee8mo ago

pr0n

NitpickLawyer8mo ago

"I swear I'm seeding those just so I get my ratio up" :)

1 more reply

coleca8mo ago· 4 in thread

g413n8mo ago

epistasis8mo ago

What sort of deal are you taking about? Would it be 50% or more?

master_crab8mo ago

You can get way higher than 50% discounts with AWS (or any cloud) depending upon the scale of the buy.

oasisbob8mo ago

Not for that minimum 0.5PB volume.

Even at 10PB, the storage commit discounts won't be anywhere near 50%. Probably more like 10-20%, if that.

not--felix8mo ago· 4 in thread

But where do you get 90 million hours worth of video data?

_1tem8mo ago

And not just any video data, they specifically mentioned screen recordings for agentic computer uses. A very specific kind of video. My guess is they have a partnership with someone like Rewind.ai

Barbing8mo ago

“For your privacy, your screen and audio recordings are stored locally and NEVER leave your Mac.”

Tell me it’s only someone _like_ Rewind and not actually them! Quoting from the Privacy page they link in their header.

1 more reply

bobbob19218mo ago

conception8mo ago

Arrr matey

Scramblejams8mo ago· 3 in thread

Fun piece, thanks to the author. But for vicarious thrills like this, more pictures are always appreciated!

echelon8mo ago

If the authors chime in, I'd like to ask what "Standard Intelligence PBC" does.

Is it a public benefit corp?

What are y'all building?

nee1rOP8mo ago

We did want more pictures!! Recently bought a Sony A7III to capture more fun moments like this.

1 more reply

kid648mo ago

Many colos disallow photography.

boulos8mo ago· 3 in thread

cornholio8mo ago

It goes without saying any data pre-processing needs to be done before writing, at the storage site, or on the training GPUs.

g413n8mo ago

yeah we just have the 100gig link, atm that's about all the gpu clusters can pull but we'll prob expand bandwidth and storage as we scale.

I guess worth noting that we do have a bunch of 4090s in the colo and it's been super helpful for e.g. calculating embeddings and such for data splits.

mwambua8mo ago

1 more reply

pronoiac8mo ago· 3 in thread

I wonder if they'll go with "toploaders" - like Backblaze Storage Pods - later. They have better density and faster setup, as they don't have to screw in every drive.

g413n8mo ago

yeah we're very interested in trying toploaders, we'll do a test rack next time we expand and switch to that if it goes well.

joshvm8mo ago

"don't have to screw in every drive" is relative, but at least tool-less drive carriers are a thing now.

tempest_8mo ago

Used Supermicro machines of this generation and very cheap (all things considered)

https://www.theserverstore.com/supermicro-superstorage-ssg-6...

urbandw311er8mo ago· 3 in thread

g413n8mo ago

urbandw311er8mo ago

Oops sorry, my bad! Great to read all about it - good luck with the project.

Tepix8mo ago

He did mention that it would have been a higher up-front cost.

fragmede8mo ago· 3 in thread

hnav8mo ago

renewiltord8mo ago

Problem when you self-roll this is that you inevitably make mistakes and the cycle time of going down and up ruins everything. Access trumps everything.

Learned this lesson painfully.

g413n8mo ago

it's not just in sf it's across the street from our office

this has been incredibly nice for our first hardware project, if we ever expand substantially then we'd def care more about the colo costs.

drnick18mo ago· 3 in thread

Everyone should give AWS the middle finger and start doing this. Beyond cost, it's a matter of sovereignty over one's computing and data.

twoodfin8mo ago

If this is a real market, I’d expect AWS to introduce S3 Junkyard with a similar durability and cost structure.

They probably still won’t budge on the egress fees.

g413n8mo ago

we would be so down to buy s3 junkyard tbh we were going around begging various storage clouds to offer us this before giving up and building it ourselves

Barbing8mo ago

>S3 Junkyard

There it is, the answer to how to mitigate brand damage when risking distance between themselves and some of those 9s.

landryraccoon8mo ago· 3 in thread

Their electricity costs are $10K per month or about $120K per year. At an interest rate of 7% that's $1.7M of capital tied up in power bills.

datadrivenangel8mo ago

At 120K per year over the three year accounting life of the hardware, that's 360k... how do you get to 1.7M?

landryraccoon8mo ago

It seems unlikely to me that they'll never have to retrain their model to account for new data. Is the assumption that their power usage drastically drops after 3 years?

Unless they go out of business in 3 years that seems unlikely to me. Is this a one-off model where they train once and it never needs to be updated?

moffkalast8mo ago

Let's just say we're not seeing all of these sudden private nuclear reactor investments for no reason.

yodon8mo ago· 3 in thread

Any startup that has enough money to casually buy a two-letter domain name has too much money, period. Kind of like counting the number of Aeron chairs at startups of old. Not a good sign.

638mo ago

Looks like unregistered two letter .inc domains are going for $2300/yr. Certainly <5% of the cost of a single developer.

lmm8mo ago

It's a .inc domain name, are those worth anything?

dangoodmanUT8mo ago

it's .inc... those aren't expensive

miniman13378mo ago· 3 in thread

Used Disks, No DR, not exactly a real shoot out.

nee1rOP8mo ago

True, though this is specifically for pretraining data (S3 wouldn't sell us used disk + no DR storage).

Sanzig8mo ago

I do appreciate the scrappiness of your solution. Used drives for a storage cluster is like /r/homelab on steroids. And since it's pretraining data, I suppose data integrity isn't critical.

Most venture-backed startups would have just paid the AWS or Cloudflare tax. I certainly hope your VCs appreciate how efficient you are being with their capital :)

1 more reply

p_ing8mo ago

You're in a seismically active part of the world. Will the venture last in a total loss scenario?

2 more replies

zparky8mo ago· 3 in thread

$125/disk, 12k/mo depreciation cost which i assume means disk failures, so ~100 disks/mo or 1200/yr, which is half of their disks a year - seems like a lot.

AnotherGoodName8mo ago

zparky8mo ago

oh i see, thanks! i might be too used to reading backblaze reports :p

devanshp8mo ago

no, we wanted to be conservative by depreciating somewhat more aggressively than that. we have much closer to 5% yearly disk failure rates.

jonas218mo ago· 2 in thread

Nice writeup. All of the technical detail is great!

nee1rOP8mo ago

toomuchtodo8mo ago

Please consider posting the quotes, even if you have to redact colo names.

intalentive8mo ago· 2 in thread

“Solve computer use” and previous work is audio conversation model. How do these go together? Is the idea to replace keyboard and mouse with spoken commands? a la Star Trek

g413n8mo ago

just general research work. Once the recipes are efficient enough the modality is a smaller detail.

On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.

nerpderp828mo ago

Make me transparent aluminum!

supermatt8mo ago· 2 in thread

Where does one get “90 million hours of video data”?

hmcamp8mo ago

I’m also curious about this. I don’t recall seeing that mentioned in the article

supermatt8mo ago

Its in the first sentence: "We built a storage cluster in downtown SF to store 90 million hours worth of video data."

1 more reply

tarasglek8mo ago· 2 in thread

i am still confused what their software stack is, they dont use ceph but bought netapp, so they use nfs?

OliverGuy8mo ago

The NetApps are just disk shelves, can plug it into a SAS controller and use whatever software stack you please.

tarasglek8mo ago

but they have multiple head nodes, so its some distributed setup or just active/passive type thing?

3 more replies

pighive8mo ago· 2 in thread

HDDs - are never one time costs. Do datacenters also offer ordering and replacing HDDs?

epistasis8mo ago

With 30PB it's likely they will simply let capacity fall as drives fail.

They apparently have zero need for redundancy in their use case, and the failure rate won't be high enough to take out a significant percentage of their capacity.

Symbiote8mo ago

They offer replacing, yes, but normally expect you to order the new one. (Usually covered by a warranty, sent next business day.)

OutOfHere8mo ago· 2 in thread

nee1rOP8mo ago

Definitely much less redundancy, this was definitely a tradeoff we made for pretraining data and cost.

Sanzig8mo ago

Did you do any kind of redundancy at least (eg: putting every 10 disks in RAID 5 or RAID Z1)? Or I suppose your training application doesn't mind if you shed a few terabytes of data every so often?

1 more reply

huxley_marvit8mo ago· 2 in thread

damn this is cool as hell. estimate on the maintenance cost in person-hours/month?

nee1rOP8mo ago

Around 2-5 hours/month, mostly powercycling the servers and replacing hard drives

Symbiote8mo ago

You should be able to power cycle the servers from their management interfaces.

(But I have the luxury of everything being bought new from HP, so the interfaces are similar.)

OliverGuy8mo ago· 2 in thread

Plus, the TCO is already way under the cloud equiv. so might as well spend a little more to get something much newer and more reliable

g413n8mo ago

yeah it's on the wishlist to try

bobbob19218mo ago

Thanks to op for actually replying to the various comments here - really appreciate that (and for the initial of course!)

synack8mo ago· 2 in thread

IPMI is great and all, but I still prefer serial ports and remote PDUs. Never met a BMC I could trust.

toast08mo ago

jeffrallen8mo ago

Try Lenovo. Their BMCs Don't Suck (tm).

leejaeho8mo ago· 2 in thread

how long do you think it'll be before you fill all of it and have to build another cluster LOL

nee1rOP8mo ago

Already filled up and looking to possibly copy and paste :)

giancarlostoro8mo ago

So, others have asked, and I'm curious myself are you sourcing the videos yourselves or third parties?

1 more reply

lucb1e8mo ago· 1 in thread

The linked Discord post is also interesting and fun to read. Most of the post is more serious but this is one of the small gems:

https://discord.com/blog/how-discord-stores-trillions-of-mes...

g413n8mo ago

yeah we weren't sure about putting that number esp whether it includes all the image attachments, but in any case it's at least around the right reference class for the largest text data operations.

archmaster8mo ago· 1 in thread

Had the pleasure of helping rack drives! Nothing more fun than an insane amount of data :P

nee1rOP8mo ago

Thanks for helping!!!

htrp8mo ago· 1 in thread

So how many actual man hours for 2400 drives?

g413n8mo ago

around 250

RagnarD8mo ago· 1 in thread

I love this story. This is true hacking and startup cost awareness.

nee1rOP8mo ago

Thanks!! :)

Onavo8mo ago· 1 in thread

What do you plan to do if you start getting corruption and bitrot? The complexity of S3 comes with a lot of hard guarantees for data integrity.

g413n8mo ago

our training stack doesn't make strong assumptions about data integrity, it's chill

jmakov8mo ago· 1 in thread

Wonder why everybody's first pick is CEPH which is known for being hard to optimize vs e.g. SeaweedFS

jnsaff28mo ago

If I'd have to guess then I would think that Ceph is the only one who is truly open source and does not feature gate important parts to paid enterprise users.

That said, if the Enterprise options are performing better or you at least get good support for tuning and optimizing then the alternatives could be well worth consideration.

Havoc8mo ago· 1 in thread

Cool write-up.

I do feel sorry for the friends that go suckered into doing a bunch of grunt work for free though

g413n8mo ago

I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p

akreal8mo ago· 1 in thread

How is/was the data written to disks? Something like rsync/netcat?

nee1rOP8mo ago

We use the same nginx rust server to do file writes, it's done via web requests

g413n8mo ago· 1 in thread

the doodles are great

nee1rOP8mo ago

Thanks! Lots of hard work went into them.

ThinkBeat8mo ago· 1 in thread

So now you have all

- your storage in one place

- you own all backup,

-- off site backup (hot or cold)

- uptime worries

- maintenance drives

-- how many can fail. before it is a problem

- maintenance machines

-- how many can fail. before it is a problem

- maintenance misc/datacenter

- What to do the electricity is cut off suddenly

-- do you have a backup provider?

-- disel generators?

-- giant batteries?

-- Will the backup power also run cooling?

-natural disaster

-- earthquake

-- flooding

-- heatwave

- physical security

- employee training / (esp. if many quit)

- backup for networking (and power for it)

- employees on call 24/7

- protection against hacking

+++++

I agree that a lot of cloud providers overcharge by a lot, but doing it all yourself gives you a lot of headaches.

co-hosting would seem like a valuable partial mitigator.

pclmulqdq8mo ago

Most of these come from your colo provider (including a good backup power and networking story), and you can pay remote hands for a lot of the rest.

Things like "protection from hacking" also don't come from AWS.

Zvez8mo ago

That's basically scaled up story of 'I store my files on my computer and it is 10x cheaper than using dropbox'

Yes, cost for self-hosting most probably still be less than aws (aws is not cheap). But it might start to be comparable with storage solutions of small ('neo') cloud providers if you buy gpu there.

renewiltord8mo ago

Still fun for someone wanting to stick a computer in a DC though.

Networking is surprisingly hard but we also settled for the cheapo life QSFP instead of the new Cisco switches that do 800 Gbps that are coming. Great writeup.

0xbadcafebee8mo ago

miltonlost8mo ago

And how much did the training data cost?

speransky8mo ago

Why re-invent the wheel instead of using Lustre filesystem? It's easy to deploy on such a small filesystem; it is easy enough. POSIX interface, multiple clients, supports high-speed networking...

ClaireBookworm8mo ago

great write up, really appreciate the explanations / showing the process

neilv8mo ago

As a fan of eBay for homelab gear, I appreciate the can-do scrappiness of doing it for a startup.

To adapt the old enterprise information infrastructure saying for startups:

"Nobody Ever Got Fired for Buying eBay"

ThrowawayTestr8mo ago

DIY is always cheaper than paying someone else. Great write-up.

Hobadee8mo ago

alchemist1e98mo ago

Would have been much easier and probably cheaper to buy gear from 45drives.

winterrx8mo ago

Great piece, thanks for the write up.

lisbbb8mo ago

Garbage in, garbage out, 30 petabyte edition

j / k navigate · click thread line to collapse