Comparison of Attic vs. Bup vs. Obnam (opens in new tab)

(librelist.com)

42 pointsBrian-Puccio11y ago22 comments

22 comments

20 comments · 5 top-level

static_noise11y ago· 5 in thread

Isn't the real conclusion that all three tools failed the test case at some point (data corruption/too slow/aborted)?

Which backup tools should we use for linux?

As a heavy btrfs user backups have always been on my mind. I run a lab with a handful of busy VMs, all using btrfs. I was frustrated that there were no backup solutions (at the time) which leveraged btrfs, so I created snazzer [1] (one day soon it will support ZFS).

You might scoff, but... btrfs send/receive is insanely fast and painless. To mitigate btrfs shenanigans, snapshots end up on non-btrfs filesystems too. I wrote a tool [2] which produces PGP signatures and sha512sums of snapshots to achieve reproducible integrity measurements regardless of FS.

Of course, in the time it took to polish up snazzer a bit for public release, many [3] other [4] cool [5] solutions [6] have materialized [7]... :)

[1] https://github.com/csirac2/snazzer

[2] https://github.com/csirac2/snazzer/blob/master/doc/snazzer-m...

[3] https://github.com/masc3d/btrfs-sxbackup

[4] https://github.com/digint/btrbk

[5] https://github.com/jimsalterjrs/sanoid/

[6] https://github.com/lordsutch/btrfs-backup

[7] https://github.com/jf647/btrfs-snap

static_noise11y ago

Thanks! btrfs is certainly something to think about. Especially compared to ext4 it seems to make backups much easier and less painful.

dspillett11y ago

> Isn't the real conclusion that all three tools failed the test case at some point

No backup solution is perfect, anything including your backups can fail. The three rules of backups:

* If it isn't backed up you don't really care about it

* If you really care about something back it up using at least two unrelated systems, one or more remote and one or more offline (soft-offline will do).

* If you haven't tested the backups, you don't have backups.

> Which backup tools should we use for linux?

I'm still using simple hand-rolled scripts to manage backups via rsync (for an old but still good tutorial on that sort of thing see http://www.mikerubel.org/computers/rsync_snapshots/) and (where consistency and/or downtime between backup start and end times might be an issue) LVM snapshots. I've needed tweak them a bit over the years as my needs have changed and as I've become more paranoid about the "testing backups" thing, but I'd have had to do that with other tools too.

rakoo11y ago

restic (https://github.com/restic/restic) is another contender, and of course tarsnap (https://www.tarsnap.com/) has proven its worth.

e4011y ago

dump/restore have always done it for me.

reedlaw11y ago· 4 in thread

I wish there was more information about the kind of data corruption caused by Attic. Is there a related issue here? [1]

1. https://github.com/jborg/attic/search?q=corrupt&type=Issues&...

static_noise11y ago

In addition, some form corruption is expected to happen due to factors besides the backup software, such as bugs in the kernel or hardware errors.

What I wonder more is why the corruption of the file data wasn't caught by the backup software since it uses checksums for deduplication.

A third option is user error such as that between the (user) checksums and the backup there were changes to some files.

StavrosK11y ago

He mentions some more details here:

http://librelist.com/browser/attic/2015/3/31/comparison-of-a...

gingerlime11y ago

I think this one in particular (related to msgpack depedency?): https://github.com/jborg/attic/issues/264

StavrosK11y ago

Also, wouldn't "attic check" mitigate it? My backup script always runs "backup" "prune" and "check", in that order.

That said, I freaking love attic. It's the best backup program ever.

avar11y ago· 3 in thread

From the article:

    Of particular concern is that Obnam has a theoretical collision
    potential, in that if a block has the same MD5 hash as another
    block, it will assume they are the same. This behaviour is the
    default, but can be mitigated by using the verify option. I tried
    with and without, and interestingly did not notice any speed
    difference (2 seconds, which is statistically insignificant) and
    also did not encounter any bad data on restoration. So I don't
    know why it's off by default.

Worrying about this violates Taylor's Law of Programming Probability[1]:

    The theoretical possibility of a catastrophic occurrence in your
    program can be ignored if it's less likely than the entire
    installation being wiped out by meteor strike.

I've seen a lot of sysadmins or programmers nitpick systems that have the theoretical possibility of md5 or sha1 collisions, but it's amazingly unlikely to happen in something like a backup system where you're backing up your own data, and not taking hostile user data where the users might be engineering collisions:

1. http://www.miketaylor.org.uk/tech/law.html

Uberphallus11y ago

It's unlikely to happen by chance, but it can be quite vulnerable to malicious attacks.

avar11y ago

"Quite". Let's look at the potential attack. You're running a backup system with user-supplied data, fair enough, and one of your users has:

    1) Access to an existing object, or its checksum.

    2) Can write a *new* object where they intentionally
       produce a collision with an existing object.

There's a trivial way to get around this attack in practice, which is that you just lazily write objects and don't re-write an object that exists already. This is what Git does with the objects it writes, which insulates it more from future SHA-1 collision attacks than just the security you'd get from SHA-1 itself.

This means that you've changed an attack where someone can maliciously clobber an existing object to an edge case where their object just won't get backed up.

1 more reply

static_noise11y ago

E.g. by backing up two files which are designed to demonstrate a MD5 collision.

zobzu11y ago· 3 in thread

YA: http://zbackup.org/

YA: http://duplicity.nongnu.org/

static_noise11y ago

Care to elaborate how they compare? Do they fit the use case of millions of files and terabytes of data?

StavrosK11y ago

I don't like duplicity very much, it requires you to reupload everything every so often (because it uses base backups and then diffs on top of that), which won't work for my slow connection and large dataset.

1 more reply

zobzu11y ago

I would like to have such a nice comparison as the page here but I don't really.

I like duplicity but it's not perfect. zbackup seems faster/more integrated and also not have the reputation for losing data that attic has.

Never lost data with duplicity in about 5 years of use.

aidenn011y ago

So the conclusion was that the tool that corrupts your data is the fastest?

I have a backup solution that corrupts your data, but is even faster than Attic: tar cp > /dev/null

j / k navigate · click thread line to collapse

22 comments

20 comments · 5 top-level

static_noise11y ago· 5 in thread

Isn't the real conclusion that all three tools failed the test case at some point (data corruption/too slow/aborted)?

Which backup tools should we use for linux?

csirac211y ago

Of course, in the time it took to polish up snazzer a bit for public release, many [3] other [4] cool [5] solutions [6] have materialized [7]... :)

[1] https://github.com/csirac2/snazzer

[2] https://github.com/csirac2/snazzer/blob/master/doc/snazzer-m...

[3] https://github.com/masc3d/btrfs-sxbackup

[4] https://github.com/digint/btrbk

[5] https://github.com/jimsalterjrs/sanoid/

[6] https://github.com/lordsutch/btrfs-backup

[7] https://github.com/jf647/btrfs-snap

static_noise11y ago

Thanks! btrfs is certainly something to think about. Especially compared to ext4 it seems to make backups much easier and less painful.

dspillett11y ago

> Isn't the real conclusion that all three tools failed the test case at some point

No backup solution is perfect, anything including your backups can fail. The three rules of backups:

* If it isn't backed up you don't really care about it

* If you really care about something back it up using at least two unrelated systems, one or more remote and one or more offline (soft-offline will do).

* If you haven't tested the backups, you don't have backups.

> Which backup tools should we use for linux?

rakoo11y ago

restic (https://github.com/restic/restic) is another contender, and of course tarsnap (https://www.tarsnap.com/) has proven its worth.

e4011y ago

dump/restore have always done it for me.

reedlaw11y ago· 4 in thread

I wish there was more information about the kind of data corruption caused by Attic. Is there a related issue here? [1]

1. https://github.com/jborg/attic/search?q=corrupt&type=Issues&...

static_noise11y ago

In addition, some form corruption is expected to happen due to factors besides the backup software, such as bugs in the kernel or hardware errors.

What I wonder more is why the corruption of the file data wasn't caught by the backup software since it uses checksums for deduplication.

A third option is user error such as that between the (user) checksums and the backup there were changes to some files.

StavrosK11y ago

He mentions some more details here:

http://librelist.com/browser/attic/2015/3/31/comparison-of-a...

gingerlime11y ago

I think this one in particular (related to msgpack depedency?): https://github.com/jborg/attic/issues/264

StavrosK11y ago

Also, wouldn't "attic check" mitigate it? My backup script always runs "backup" "prune" and "check", in that order.

That said, I freaking love attic. It's the best backup program ever.

avar11y ago· 3 in thread

From the article:

    Of particular concern is that Obnam has a theoretical collision
    potential, in that if a block has the same MD5 hash as another
    block, it will assume they are the same. This behaviour is the
    default, but can be mitigated by using the verify option. I tried
    with and without, and interestingly did not notice any speed
    difference (2 seconds, which is statistically insignificant) and
    also did not encounter any bad data on restoration. So I don't
    know why it's off by default.

Worrying about this violates Taylor's Law of Programming Probability[1]:

    The theoretical possibility of a catastrophic occurrence in your
    program can be ignored if it's less likely than the entire
    installation being wiped out by meteor strike.

1. http://www.miketaylor.org.uk/tech/law.html

Uberphallus11y ago

It's unlikely to happen by chance, but it can be quite vulnerable to malicious attacks.

avar11y ago

"Quite". Let's look at the potential attack. You're running a backup system with user-supplied data, fair enough, and one of your users has:

    1) Access to an existing object, or its checksum.

    2) Can write a *new* object where they intentionally
       produce a collision with an existing object.

This means that you've changed an attack where someone can maliciously clobber an existing object to an edge case where their object just won't get backed up.

1 more reply

static_noise11y ago

E.g. by backing up two files which are designed to demonstrate a MD5 collision.

zobzu11y ago· 3 in thread

YA: http://zbackup.org/

YA: http://duplicity.nongnu.org/

static_noise11y ago

Care to elaborate how they compare? Do they fit the use case of millions of files and terabytes of data?

StavrosK11y ago

1 more reply

zobzu11y ago

I would like to have such a nice comparison as the page here but I don't really.

I like duplicity but it's not perfect. zbackup seems faster/more integrated and also not have the reputation for losing data that attic has.

Never lost data with duplicity in about 5 years of use.

aidenn011y ago

So the conclusion was that the tool that corrupts your data is the fastest?

I have a backup solution that corrupts your data, but is even faster than Attic: tar cp > /dev/null

j / k navigate · click thread line to collapse