There needs to be some law about how temporary directories always end up containing vitally important data.
Perhaps the solution is some netflix-like chaos monkey that randomly deletes files..;) Or for each user over its soft quota, delete the oldest files until under the quota. Or something like that..
The problem with a temporary solution is that it makes the problem go away, so suddenly there is no longer any incentive to fix it properly.
Article title: The largest unplanned outage in years and how we survived it
Article overview: A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.
Although technically correct, the HN title is misleading.
It was an interesting read, so thumbs up for the dramatization.
Given that the latter statement is from the article, how is the former "technically correct"?
One obvious solution would be to use a ramdisk, a virtual disk that actually resides in the memory of a node. The problem was that even our biggest system had 1.5TB of memory while we needed at least 3TB.
As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.
A hack they weren't at all sure would work, but it did nicely.
Or was the inode problem not a local disk problem but a problem in the Luster fs? I couldn't quite tell from the article.
Lustre, for those who don't know it, is a cluster meta-filesystem, with separate metadata and object servers, each sitting on top of host file systems/RAID/storage.
It seems pretty impossible to find out the exact root cause in retrospect as the system was running for a long time without apparent issue. Any ideas are welcome though.
I've had to employ the horrible hack of iscsi from compute nodes, raided and re-exported, but it's not what I'd have tried to use first. The article doesn't mention the possibility of just spinning up a parallel filesystem on compute node local disks (assuming they have disks); I wonder if that was ruled out. I don't have a good feeling for the numbers, but I'd have tried OrangeFS on a good number of nodes initially.
By the way, it's been pointed out that RAM disk is relatively slow, if in the context of data rates rather than metadata <http://mvapich.cse.ohio-state.edu/static/media/publications/....