1.7 petabytes and 850M files lost, and how we survived it (opens in new tab)

(csc.fi)

92 pointsbeck510y ago33 comments

33 comments

25 comments · 7 top-level

zimpenfish10y ago· 5 in thread

"The directory is intended for temporary storage of results before staging them into a more permanent location [...] During the three years that the filesystem has been in operation, it has accumulated 1.7 Petabytes of data in 850 million objects."

There needs to be some law about how temporary directories always end up containing vitally important data.

sevensor10y ago

What was interesting to me about this was that they had decided not to enforce a deletion policy on /wrk, because they had so much space and the filesystem hadn't ever failed. But a rolling deletion policy would have gone a long way to containing the damage by encouraging the users to move their data to a filesystem optimized for reliability instead of availability. Still, I appreciate the heroics involved in restoring the data.

ople10y ago

Author here. We have had an automated deletion policy on our previous filesystems but opted out this time: There are users that have temporary files that they want to persist on the /wrk and we have plenty of capacity. We definitely learned our lesson, though. :)

1 more reply

jabl10y ago

The (inevitable?) consequence of deletion policies that typically delete based on mtime, tend to be users putting touch scripts in cron, making the metadata servers even more of a bottleneck than they already are. Been there, done that.

Perhaps the solution is some netflix-like chaos monkey that randomly deletes files..;) Or for each user over its soft quota, delete the oldest files until under the quota. Or something like that..

lmkg10y ago

It's not just files. Anything originally designed as a temporary stop-gap has a habit of becoming permanent. I first learned this about portable classrooms, then laws, then organizational practices. By comparison, the temp directory isn't so surprising.

Aaargh2031810y ago

There are few things as permanent as a temporary solution.

The problem with a temporary solution is that it makes the problem go away, so suddenly there is no longer any incentive to fix it properly.

1 more reply

ghubbard10y ago· 4 in thread

Current HN Title: 1.7 petabytes and 850M files lost, and how we survived it.

Article title: The largest unplanned outage in years and how we survived it

Article overview: A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.

Although technically correct, the HN title is misleading.

distances10y ago

To be fair, giving some numbers in the title makes the link much more interesting. With the original title this piece wouldn't probably have made it to the front page, as it doesn't even hint that this is about a scientific computing center.

It was an interesting read, so thumbs up for the dramatization.

tnorthcutt10y ago

1.7 petabytes and 850M files lost --- 1.7 petabytes and 850 million files were recovered

Given that the latter statement is from the article, how is the former "technically correct"?

biot10y ago

Imagine an article "One web server lost and how we survived it" simply said "Our load balancer automatically removed that server from the pool and we let the other 15 web servers pick up the load. We didn't have to do anything." This is different from "Oh crap, we only had one web server and we absolutely had to do a lengthy recovery process to get it back online."

dantillberg10y ago

Yeah; I read the headline and presumed that they had been using a resilient data duplication scheme that allowed them to recover from the catastrophic loss of e.g. an entire datacenter's worth of data.

hga10y ago· 3 in thread

Lots of fun; while backing up the filesystem prior to wiping and rebuilding it, they ran out of IOPS to do it in a reasonable time frame, so after considering other options:

One obvious solution would be to use a ramdisk, a virtual disk that actually resides in the memory of a node. The problem was that even our biggest system had 1.5TB of memory while we needed at least 3TB.

As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.

A hack they weren't at all sure would work, but it did nicely.

powercf10y ago

Couldn't they add 1.5TB of swap to their 1.5TB of memory system and run a ramdisk on that? I'm curious what performance would look like, but given 2-3k IOPS for the on-disk solution, and 20k IOPS for the in-memory I would naively expect at least 11k IOPS for random access, which should have been fast enough without the headache of clustering?

ople10y ago

Author here: We considered that but as the access pattern was likely pretty much random, the performance would have been terrible. Due to the break we had nearly a 1000 clustered servers sitting idle so it was reasonably quick to do the ramdisk trick.

1 more reply

hga10y ago

My guess is the "headache of clustering" wasn't a big one for them, they do this for a living, and by that time in the process/downtime, they wanted the job done ASAP.

ajford10y ago· 3 in thread

Out of curiosity, why weren't they running the metadata drive in a mirroring raid? If you have PB of data, wouldn't it make sense to spend the ~$100 for a second 3TB drive to mirror your metadata?

Or was the inode problem not a local disk problem but a problem in the Luster fs? I couldn't quite tell from the article.

pinewurst10y ago

It's almost a certainty that the MDS (metadata server) was situated on a mirrored RAID (prob RAID10). I'm guessing that the RAID system itself (software MDRAID or some HW array, DDN or something like a NetApp E-Series) corrupted the data under the FS that the MDS used, which I'm also assuming was XFS.

Lustre, for those who don't know it, is a cluster meta-filesystem, with separate metadata and object servers, each sitting on top of host file systems/RAID/storage.

ople10y ago

The metadata target (MDT) in the MDS is actually "ldiskfs" which is an enhanced version of ext4. One possibility may be to use ZFS in the future as the support in Lustre seems to be quite stable now.

It seems pretty impossible to find out the exact root cause in retrospect as the system was running for a long time without apparent issue. Any ideas are welcome though.

ople10y ago

It was filesystem-level corruption in Lustre. The underlying disk arrays and other hardware have comprehensive redundancy.

gnufx10y ago· 2 in thread

I'm surprised that the copying bottleneck seems to have been entirely at the target rather than the source. Is that because there were multiple copies of the source?

I've had to employ the horrible hack of iscsi from compute nodes, raided and re-exported, but it's not what I'd have tried to use first. The article doesn't mention the possibility of just spinning up a parallel filesystem on compute node local disks (assuming they have disks); I wonder if that was ruled out. I don't have a good feeling for the numbers, but I'd have tried OrangeFS on a good number of nodes initially.

By the way, it's been pointed out that RAM disk is relatively slow, if in the context of data rates rather than metadata <http://mvapich.cse.ohio-state.edu/static/media/publications/....

ople10y ago

The reading of the metadata required quite a lot of random acces. We were fairly sure that if a high-end array and controller with fast disks is struggling with it, then a traditional clustered solution with slower node local disks would not fare much better. Thus we tried to find the solution which yields the highest possible IOPS.

gnufx10y ago

I misunderstood the bottleneck, not having had to do that. (Distributed metadata for the parallel filesystem could actually be tuned to be memory resident.)

beezle10y ago· 1 in thread

I bookmarked this for whenever I think I'm having a really bad day...

ople10y ago

Hehe.. In retrospect the whole team was in fairly good spirit although the situation was stressful. A lot of this was due to the top management giving the time and space for the specialists to do their thing and the very understanding response from the customers once we explained the situation.

pinewurst10y ago

It should be noted that this is about a Lustre filesystem hosted on DDN hardware. It's unclear whether the failed controller contributed to the file system corruption, but Lustre is quite capable of accelerating local entropy all by itself. It was designed/spec-ed at LLNL as huge file, high performance, short term scratch/swap and even after 15 years isn't especially reliable or fit for use outside that domain.

j / k navigate · click thread line to collapse

33 comments

25 comments · 7 top-level

zimpenfish10y ago· 5 in thread

There needs to be some law about how temporary directories always end up containing vitally important data.

sevensor10y ago

ople10y ago

1 more reply

jabl10y ago

Perhaps the solution is some netflix-like chaos monkey that randomly deletes files..;) Or for each user over its soft quota, delete the oldest files until under the quota. Or something like that..

lmkg10y ago

Aaargh2031810y ago

There are few things as permanent as a temporary solution.

The problem with a temporary solution is that it makes the problem go away, so suddenly there is no longer any incentive to fix it properly.

1 more reply

ghubbard10y ago· 4 in thread

Current HN Title: 1.7 petabytes and 850M files lost, and how we survived it.

Article title: The largest unplanned outage in years and how we survived it

Article overview: A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.

Although technically correct, the HN title is misleading.

distances10y ago

It was an interesting read, so thumbs up for the dramatization.

tnorthcutt10y ago

1.7 petabytes and 850M files lost --- 1.7 petabytes and 850 million files were recovered

Given that the latter statement is from the article, how is the former "technically correct"?

biot10y ago

dantillberg10y ago

hga10y ago· 3 in thread

Lots of fun; while backing up the filesystem prior to wiping and rebuilding it, they ran out of IOPS to do it in a reasonable time frame, so after considering other options:

A hack they weren't at all sure would work, but it did nicely.

powercf10y ago

ople10y ago

1 more reply

hga10y ago

My guess is the "headache of clustering" wasn't a big one for them, they do this for a living, and by that time in the process/downtime, they wanted the job done ASAP.

ajford10y ago· 3 in thread

Out of curiosity, why weren't they running the metadata drive in a mirroring raid? If you have PB of data, wouldn't it make sense to spend the ~$100 for a second 3TB drive to mirror your metadata?

Or was the inode problem not a local disk problem but a problem in the Luster fs? I couldn't quite tell from the article.

pinewurst10y ago

Lustre, for those who don't know it, is a cluster meta-filesystem, with separate metadata and object servers, each sitting on top of host file systems/RAID/storage.

ople10y ago

The metadata target (MDT) in the MDS is actually "ldiskfs" which is an enhanced version of ext4. One possibility may be to use ZFS in the future as the support in Lustre seems to be quite stable now.

It seems pretty impossible to find out the exact root cause in retrospect as the system was running for a long time without apparent issue. Any ideas are welcome though.

ople10y ago

It was filesystem-level corruption in Lustre. The underlying disk arrays and other hardware have comprehensive redundancy.

gnufx10y ago· 2 in thread

I'm surprised that the copying bottleneck seems to have been entirely at the target rather than the source. Is that because there were multiple copies of the source?

By the way, it's been pointed out that RAM disk is relatively slow, if in the context of data rates rather than metadata <http://mvapich.cse.ohio-state.edu/static/media/publications/....

ople10y ago

gnufx10y ago

I misunderstood the bottleneck, not having had to do that. (Distributed metadata for the parallel filesystem could actually be tuned to be memory resident.)

beezle10y ago· 1 in thread

I bookmarked this for whenever I think I'm having a really bad day...

ople10y ago

pinewurst10y ago

j / k navigate · click thread line to collapse