* first post: http://blogs.intellique.com/tech/2010/12/22#dedupe
* detailed setup and benchmark results: http://blogs.intellique.com/tech/2011/01/03#dedupe-config
After more than 9 months running lessfs, I recommend it.
Really good paper which describes in detail how the deduplication works.
- Can dedup be integrated into the VFS layer, like unionfs is shooting for, or does it have be integrated with the underlying filesystem.
- Is online dedup possible, and does the answer change when running SSD.
- What's the best granularity (block-level? inode-level? block extent-level?) and how badly can it randomize the i/o. I imagine one would have to do a lot of real-world benchmarking to find this out.
- Are there possible privacy issues (i.e. finding through i/o patterns whether someone else has a given block or file stored) and how to deal with them
I don't think it would be useful, I'm just interested in the level of "standard" data duplication.
"I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes."
It's a relatively small amount. Then again - you're storing 300MB of exactly the same blocks of data... Unless they're manual backup files, this looks like a big waste to me.
I then decided to disable the dedup, because it comes at a cost - the checksum data (which would mostly be living on the SSD read cache I had attached) was occupying about 3 times the monetary worth of SSD storage space than the monetary worth of conventional disk space that the duplicate data was occupying.
I noticed that the opendedup site (linked from the article) claims a much lower volume of checksum data, relative to number of files; perhaps an order of magnitude less than I observed with ZFS, but they seem achieve that by using a fixed 128KB block size, which brings along its own waste. (ZFS uses variable block size.) I haven't actually done the numbers here but I wouldn't be at all surprised to find that for my data, the 128KB block size would be costing as much disk space as what dedup was saving me. (YMMV, of course.)
I would recommend using zfs-fuse. You don't have the FUSE->File on a filesystem->Hard disk indirection (thus more speed). And additionaly you get all the cool ZFS features! If you need even more speed there is a ZFS kernel module for linux and a dedup patch for btrfs. I don't think those are production ready though.
I wonder if there is a way to improve on that?