A corrupted file can manifest itself in many ways. But ultimately, it has to manifest itself as a business logic error, i.e. you increased balance on one entity but didn't on another, causing sum of balances to change (a corruption).
Thus, any discussion on file corruptions without a file system that supports transaction, requires every application to use a competent database underneath (SQLite) at least.
And even with transactional support in a file system or using a database, you need every application to have the correct business logic that does the transaction correctly as well.
All-in-all, correct backup cannot be solved universally without knowing all the applications. The best we can do is to probabilistically avoid obvious issues, i.e. using FS-level snapshot.
Whereas non-snapshot backups don't have that property; the material in the backup is not necessarily identical to any past state of the filesystem that existed. It's something like a past state plus random roll-backs of files, or portions of files, from multiple previous states.
Guess which of these situations applications are much more likely to be able to recover from (if any at all)?
When people write crash recovery code, they typically assume that the world simply stopped, not that it stopped, and then some data was randomly rewound to unspecified older states.
I don't think that you can reasonably defend against data that is restored from a backup, where the oldest part of the backup is an hour older than the newest.
A backup is like a raster scan image of a fast moving object. What should be a rectangular train car looks like a parallelogram: it's not a picture of any scene that existed. Imagine that the raster lines are randomly sampled (not a progressive scan, or even interlaced) and now recover a sane image.
If your application relies on multi-file collaborated persisted states (for example, an append-only log and a database snapshot), application can make reasonable assumptions on when a file state is committed, such as a database snapshot is `fsync`ed before the append-only log started. Ideally, even on a filesystem without transaction support, this order is preserved in time.
However, it may not be preserved for a "file-to-file" backup system because it can loop over the append-only log file first before the database snapshot, causing ordering issues. That could result a "ABA" problem, where the append-only log is corresponding to an older database snapshot, and potentially causes issues.
That has been said, is this a common scenario (multi-file persisted state) for applications? (I believe SQLite handles this particular ordering issue fine). I am not sure, and just want to call out filesystem snapshot solves a very particular problem.
If your application relies only one file for the state, or these files are orthogonal to each other (.xlsx or .docx or any .markdown, .mp4, .jpg files), it is a non-issue. And if you need to backup a database, you better do that with the database provided tools.
The backup process opens some N byte file and starts copying it. Some portion of the file is backed up until byte K. At that point the application writes a transaction whereby some of the data is written above K, and some below K. The backup continues and backs up half the transaction above K, combining that with the old data below K before the transaction happened.
If the application structures its scattered write transactions such that they write in increasing offset order, then maybe it's okay.
The idea is that you'll only encounter corruptions that the application could already hit due to crashes or power outages (and hence hopefully supports recovering from). For example, with naive reads, you might:
read the first half of the file
context switch to the application
application writes to the first half of the file
application fsyncs previous writes
application writes to the second half of the file
context switch back to you
read the second half of the file
and end up with data in the second half of the file that the application normally only writes after it's sure that corresponding data has been written to the first half.I don't need to backup my Discord and most likely I will be able to simply restore any iTerm configuration that I have in less time than setting up and keeping file system snapshots running.
Any DEVONThink or my excel spreadsheets or invoices / documents that I care about I can simply backup after I am done working on them. I usually work on one or two at the time. When my laptop dies I probably can just remember what was needed to be done and redo the work.
Restoring file system snapshot would usually be much more hassle than filling in single invoice again or downloading it from some provider again.
For web applications there is usually SLA where they specify how much data can be lost like 1 hour or 30 mins - but no one will realistically guarantee "no data lost" - because imagine doing full snapshot of 1 Terabyte drive it takes I suppose at least an hour anyway.
> because imagine doing full snapshot of 1 Terabyte drive it takes I suppose at least an hour anyway.
On copy-on-write filesystems, snapshots are nearly instantaneous.
> time than setting up and keeping file system snapshots running
As shown in the code snippet, pretty much all it takes is a few snapshot commands around the backup command and changing the source directory to the snapshot mount point.
I also think about snapshots differently.
For me snapshot is actual copy stored on a different hard drive or other medium. So copy operation is never going to be instant.
Scenario in the article is power loss so that is also not something I worry about that much. Mostly I worry about hardware failure where I would not be able to turn on my laptop/server again. For power loss UPS or battery in laptop is my go to solution.
Not that I've gotten around to writing that script for myself :c
But the automatic snapshots are much easier c:
If you don't back up your Discord, what will you do when Discord changes how it works or stops working?
Easy when you work on simple documents. I wouldn't want my accountant doing that.
Now, that being said, if you have the ability to set up backups where you can minimize the file system activity, you might be slightly better off doing it, but the ROI is probably fairly low unless it's extremely trivial to set that up for everything that's running.
The term of art is "crash consistent," and any ACID database must preserve all committed state across events such as power loss. Such a database is correctly backed up when copying a simultaneous point in time snapshot across all involved volumes.
Not all databases are truly ACID. Lots of software relies on uncommitted database state. But we're talking about a solved problem here; if you require ACID behavior the means to achieve exactly that are available. Any exception to that statement, including hardware misfeatures or lack of two phase commit across databases, is equivalent to "incorrectly designed."
In a correctly designed system quiescing the database isn't necessary, but might still be used as a precaution or a performance optimization.
the example of the browser profile is a classic case.
imagine multiple writes occur and all of them have to occur for the profile to be correct (either multiple files are being written or firefox will write multiple blocks to the file). if the snapshot operation occurs in the middle, then the snapshot will be "corrupt".
I believe this is actually the entire point of Windows' volume shadow service (which is sort of poo-pooed in the article), to enable applications to tell the snapshot mechanism "wait, I'm in the middle of a file system transaction" and then to pause writes until the snapshot operation occurs after they finish the in process transaction.
without such a mechanism, you are always going to be at risk with snapshots.
In https://www.cs.columbia.edu/~nieh/pubs/sosp2007_dejaview.pdf we avoided this problem by combining 2 mechanisms without having modifying applications with such a service
1) we used a log structured file system (that was inherently a snapshot, every log entry was individually mountable) and 2) we used a checkpoint/restart mechanism that saved process state and enabled us to restart the processes combined with the file system state as it was at checkpoint time. (Checkpoint would also sync all dirty pages to disk and that fs state after the sync is what we tied to the checkpoint state).
So when a process would be resumed, the file system would look exactly as the process expected it, even if the process was in the middle of what can be referred to a transaction. But that only worked because the processes were restored along with file system, if we only restored the file system, it could be inconsistent.
Former maintainer of VSS here. Yes, that's exactly right. In fact, filesystem snapshots are usually not enough for true application level consistency. As others have noted elsewhere, a filesystem snapshot is the equivalent to yanking the power cord out of the back of your computer. It's good, assuming your filesystem does atomic writes / copy-on-write / write-to-new. But we can do even better.
Imagine I have two databases - one traditional relational DB storing my app content, and a second log database. You want to keep these in sync. Well, good luck doing this with the filesystem alone. Usually your DBMS will need to be involved as well, and this is where the VSS "writer" concept comes in. When a snapshot is being taken, applications such as SQL will be invited to participate. Typically, this means they'll start to hold up writes so that things will be quieter for the snapshot. But they will also have a chance, after the fact, and to actually clean up the snapshot itself. In this case, the DBMS could roll back the log database to then match the content database.
It's correct that NTFS doesn't support snapshots natively, but Windows has the volsnap.sys driver that takes care of it. For all intents and purposes, NTFS does support copy-on-write snapshots.
Complicated? Sure. But it was quite capable, and actually a pretty cool (but sadly under appreciated) piece of technology.
As an intermediate step, it may be practical to consider shutting down your application, flush all writes (e.g. via sync command), and taking the file system snapshot then, restarting the application, and then taking the backup from that file system snapshot. At least then your critical data should have a consistent and safe backup, even if the rest of the OS _may_ be in a suspect state.
A volume snapshot if you're technical enough to split up OS and application data volumes on your local drive can potentially help but you still have registry or other issues.
The best way to handle backups, from the perspective of a backup company CEO, is to backup your critical data, and that could include application configurations.
Browsers support exporting your profile, good backup software let's you run a pre-backup job script and post-job script, so you can export your browser profile, lock certain files or folders during a backup, etc. to get a good recovery point while minimizing risk of data corruption.
Way longer topic than a comment here. If you do want full system state, pick a hypervisor (Parallels, etc.) and then backup the full VM each time you power the VM down. It'll make for much larger backups so restores are slower and storing all the recovery points will cost you a bit more too.
I use snapshots all the time because Windows uses them automatically to get complete and consistent backups.
Snapshots are always better then no snapshots.
Full backups are always safer than partial backups.
It takes just ONE forgotten path or file to make a backup completely impossible to restore.
Don’t be a smart ass with backups. Just don’t.
An example hypothetical, you have deeply embedded malware and you're unsure when it infected the system, it could have been sitting latent for months. In this situation you have to restore back and lose months of application data or restore a recent snapshot and then try to remove the malware in a clean room. It is much safer and easier to start from a clean OS and then restoring only the data on the system which is much simpler to scan for infection and sanitize.
While I am sure the process has limitations, I think that modern Windows has some amount of flexibility there, because due to a defect I've recently had to swap out my mainboard + CPU, and the system booted up just fine. The most major issue was that with the default drivers Windows had reverted to, the Ethernet controller didn't work, which could have been a little bit of a Catch-22 (no internet without the correct drivers, and no correct drivers without internet), but in practice I just downloaded the drivers on a different device and transferred them over via USB.
For someone with your technical skill level it's very manageable, most folks don't know what a mainboard is.
To an average user it's much easier to work with restoring their My Documents (and if they need full system state have Parallels save VMs under that folder path).
But also, if you have no backups, incorrect ragged backups of only non-changing files are a _hell_ of a lot better than no backups at all.
And if you get in the habit of doing incremental backups every day, you might be lucky enough to find a non-corrupted version of a file that's a few days stale.
Eventually I realized that some backups are better than no backups at all, and prioritized getting _something_ in place to at the very least have copies of important files copied regularly to another drive somewhere.
It depends a lot on your use case.
If you're backing up a production database system that's in constant use, then the files are almost certain to change while the backup is in progress. And a backup of everything on the database server except the data probably isn't what you want!
On the other hand, if I'm backing up my personal computer, my family photos aren't getting regularly overwritten - and if the backup of my web browser cache is inconsistent? So be it.
I've not been seeing any of this much vaunted reliability. 3 out of 3 of the last drives I lost data on were ZFS. Meanwhile NTFS and EXT4 has been fairly trouble free.
Though to be fair hard to tell how much of that is TrueNAS vs ZFS.
That's what I thought given that it does have a good rep.
Haven't given up on it yet, but definitely less trusting - 3 different drives in 3 different ways wasn't expected.
That said - all consumer class gear in janky environments so there may very well be other factors lurking.
So by all means, play with it but don’t consider it to be stable.
https://bugs.launchpad.net/ubuntu/+source/ubiquity/+bug/1966...
https://code.launchpad.net/~ubuntu-installer/ubiquity/+git/u...
Edit: fixed link.
It recursively snapshots mounted pools, and recursively mounts snapshots of the mounted datasets into a target ready to point your backup tools at. I do so via a chroot so I didn't need to make any changes to my Borg setup - just to how I run it.