* first post: http://blogs.intellique.com/tech/2010/12/22#dedupe
* detailed setup and benchmark results: http://blogs.intellique.com/tech/2011/01/03#dedupe-config
After more than 9 months running lessfs, I recommend it.
Really good paper which describes in detail how the deduplication works.
- Can dedup be integrated into the VFS layer, like unionfs is shooting for, or does it have be integrated with the underlying filesystem.
- Is online dedup possible, and does the answer change when running SSD.
- What's the best granularity (block-level? inode-level? block extent-level?) and how badly can it randomize the i/o. I imagine one would have to do a lot of real-world benchmarking to find this out.
- Are there possible privacy issues (i.e. finding through i/o patterns whether someone else has a given block or file stored) and how to deal with them
I don't think it would be useful, I'm just interested in the level of "standard" data duplication.
"I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes."
It's a relatively small amount. Then again - you're storing 300MB of exactly the same blocks of data... Unless they're manual backup files, this looks like a big waste to me.
I then decided to disable the dedup, because it comes at a cost - the checksum data (which would mostly be living on the SSD read cache I had attached) was occupying about 3 times the monetary worth of SSD storage space than the monetary worth of conventional disk space that the duplicate data was occupying.
I noticed that the opendedup site (linked from the article) claims a much lower volume of checksum data, relative to number of files; perhaps an order of magnitude less than I observed with ZFS, but they seem achieve that by using a fixed 128KB block size, which brings along its own waste. (ZFS uses variable block size.) I haven't actually done the numbers here but I wouldn't be at all surprised to find that for my data, the 128KB block size would be costing as much disk space as what dedup was saving me. (YMMV, of course.)
I'm puzzled why people in general aren't more worried about data corruption due to hash collision.....
I would recommend using zfs-fuse. You don't have the FUSE->File on a filesystem->Hard disk indirection (thus more speed). And additionaly you get all the cool ZFS features! If you need even more speed there is a ZFS kernel module for linux and a dedup patch for btrfs. I don't think those are production ready though.
I wonder if there is a way to improve on that?