- Our system needs to cope with millions of directories. Millions of directories for inotify mean a lot of structures in the kernel. For large numbers it can also mean gigabytes of ram. Add to that the mapping in userspace of file descriptors for rename operations.
- Using inotify would take a lot of time at startup to recurse into millions of subdirectories.
- inotify will not automatically watch new directories. You have to list all the files right after the creation of a directory and watch the subdirectories, etc. Not a problem at all, but way simpler with FUSE.
- If your system is not able to keep up with inotify events, you miss events because the kernel cannot buffer all the events. For us it's better to slow down the system but never miss an event.
- inotify is attached to the original filesystem. That means it's hard (not impossible) to handle loops in an active/active replication setup. Whereas with SFS, replication is done on straight to the original filesystem.
- If the inotify application crashes, you lose events because software keeps writing to the filesystem. If SFS crashes the mountpoint is unwritable by the application which reports an error and can switch to a different storage.
Some choices as you can read depends on the requirements, our requirements where not met by inotify.
https://lwn.net/Articles/604686/
https://lwn.net/Articles/605128/
Unfortunately the fanotify article is yet to come!
(Having said that, I certainly appreciate your pain trying to use inotify to track write events on an entire filesystem)
Our cases have also had very limited semantics (uploads that adds new files or modify files, but no deletions ever) that made things easier.
This seems a lot cleaner...
Though these days we mostly use glusterfs and most of those old apps have thankfully been retired or cleaned up.
The best introduction to orifs is their paper, which is linked from the above site.
I've longingly looked at Intermezzo/Coda a long time ago but it never went anywhere; played with block level replication (lvmsync), but it doesn't allow concurrent use; in the end, the only solution I had to fall back on was rsync (which needs to iterate over the entire directory structure, crazy expensive) and git and/or unison (both of which can't cope with many GBs).
I'm going to give orifs a try.
The only real issues I've found with unison are:
Works exclusively in terms of replica pairs. Still, I'd rather have a limited solid tool instead of a flaky perfect one.
Lack of inotify support (hacky half/embedded python stuff after the original author lost interest doesn't count).
3x disk access time for copying new files (hash source, transfer, hash dest). This could be reduced to 2x for similar consistency guarantees, but is really only an issue when creating new replicas, and can be fixed by copying the data first.
And for use as a 'filesystem', having to come up with file organization / wrapper scripts to choose what is replicated where, while still being sure what is a "full replica".
btsync (http://getsync.com) works well with at least 500GB of photo-sized files.
git-annex works well for large collections. I am using this for my own photo collection, which is about 450GB. I like it because you can do partial checkout.
I haven't tried it with a large collection but git-annex assistant should work well too, if you're interested in automatic sync.
I'm putting together the new https://neocities.org fileserver stuff right now, so I'll definitely be looking into this.
The current plan is to use hourly rsyncs, and then implement this (or some flavor of it): http://code.google.com/p/lsyncd/
RE inotify vs FUSE: The former is event-driven from an API, the latter I believe uses lower level blocks. Which one is better here is entirely debatable. Gluster uses a similar approach this does I believe. I'm not an expert on unix file APIs, so take all of this with a grain of salt.
The biggest reason we can't use Gluster replication is that if you request a file when replicating, it goes to ask all the servers if the file is on them, instead of just failing because it's not on the local system. That's fine for many things, but it's an instant DDoS if someone runs siege on the server and just blasts you with random bunk 404 requests. You can't cache your way out of that one. Apparently the performance for instant request for lots of small files can be pretty slow too.
SSHFS (and rsync using SSH) blow S3 out of the water on performance for remote filesystem work. The difference is pretty insane.
For your type of setup, you'd combine the "distribute" and "replicate" translators. The "distribute" translator spreads files out over sub-volumes by hashed path. Unless you specifically enable "lookup-unhashed" it will not forward requests for files that are not found to any other sub-volumes than the one matching the hash.
The underlying volumes would then be "replicate" and you'd there decide how many replicas you want.
There are other cluster translators than distribute too, that provides more complex distribution methods, but there you do run into overheads with determining which subvolume holds the files.
You'd also usually layer the io-cache translator in (which you can add on both the clients and the servers, and which provides in memory caching), so while there are possible pathological cases, Gluster is pretty resilient.
Also Gluster will replicate directly between servers to heal, but read/writes using the fuse fs will have the client talk directly to each backend and write in parallel to each member of the replicate set for that file, so there normally is no separate replication unless a server is/was down or a disk has crashed: normally, once a write is reported as complete, the file is on all replicas.
But frankly almost regardless of file storage solution, your best bet is still to put Nginx or another suitable reverse proxy in front of the file servers, so you can e.g. cache the hottest requested files in faster/more expensive ways (memcached, or on SSDs)
Would you run your fileservers on a publically accessible network? If not, then the above attack would require the attackers have access to your private, internal network and at that point it could be argued that you've already lost.
Of course, it's good to design defense-in-depth, so that you could survive a hit on your internal network as well. But I'm not sure how much sleep I'd lose over the possibility of a DDOS attack against a private network.
On the other hand, if your servers are on a private network, I'm not sure why you'd use SFS over something like NFS.
From Vagrant's documentation:
Vagrant can use rsync as a mechanism to sync a folder to the
guest machine. This synced folder type is useful primarily in
situations where other synced folder mechanisms are not
available, such as when NFS or VirtualBox shared folders aren't
available in the guest machine.
The rsync synced folder does a one-time one-way sync from the
machine running to the machine being started by Vagrant.
The disadvantage of the above is that it's a one-time, one-way sync. SFS would overcome this limitation, if I'm not mistaken.Though I would love to see FUSE+rsync (like this article mentions) as a default/standard option in Vagrant!
[1] https://github.com/mitchellh/vagrant/issues/3062
[2] http://www.midwesternmac.com/blogs/jeff-geerling/nfs-rsync-a...
Our requirements were two servers, the storage is several terabytes, and losing data because of an unknown bug in the filesystem wasn't an option.
Such filesystems are quorum-based and certainly need more than two servers with terabytes of data. More than what we could afford. Also if you lose quorum, the system hangs to ensure atomicity. But we don't need such properties.
Also after some testing, those filesystem resulted about 5-10% slower than traditional filesystems, with few nodes of course. They are designed to scale with many nodes.
Because of no locks, the last write wins according to rsync update semantics based on file timestamp + checksum. In case two files have the same mtime, rsync compares the checksum to decide which one wins.
In addition, I don't think lsyncd has been designed for an active/active setup, rather for a master-slave setup.
Any recommendations?
There doesn't seem to be anything else close to rsync's robustness or completeness of functionality - on Windows or Linux.
VM-1 - local NFS server that runs DBSD and Hammer filesystem with many nice features (auto snapshots, etc.) Will be fast, especially if the VM-1 is on the same physical host-node as my worker VMs.
VM-2 is remote, and receives the DBSD filesystems I send it. All snapshots etc. from the original FS are retained. If the connection is interrupted, the Hammer sync code will figure it out and restart from the latest successful transaction.
Even better, could it grab data as required from other machines?