Replicas are for failover and read scalability. In terms of failover, when a master dies unexpectedly Facebook's automation fails over to promote a replica to be the new master in under 30 seconds and with no loss of committed data.
Backups are for when something goes horribly wrong -- i.e. due to human error -- and you need to restore the state of something (a row, table, entire db, ...) to a previous point in time. Or perhaps effectively skip one specific transaction, or set of transactions. Replicas don't help with this; as you mentioned, they're kept up-to-date with the master. So a bad statatement run on the master will also affect the replicas.
Occasionally you have some massive failure involving both concepts, like you have 4 replicas and they're all broken or corrupted in some way, then backups are helpful in that case as well.
In failure modes like accidental delete without WHERE clause, or a write that corrupts the validity of the business logic, it's useless, as you can watch the logs to see all your slave machines keenly and unquestioningly repeating the issue.
The wording is a bit cryptic, but it does seem that they definitely have hot standby-esque capabilities in place, in addition to long-term storage of their incremental/full backups.
[1] - https://www.facebook.com/notes/facebook-engineering/under-th...
Do you diff against the same base, or create an incremental chain? How many diffs do you take in between recapturing a full image? At $DAYJOB we always take full backups into a fast in-house deduplicating store.
> Periodically, each peon syncs with the ORC DB to look for new jobs assigned to it
Is there no better way to handling this than polling?
> LOAD - Load the downloaded backup into the peon’s local MySQL instance. Individual tables are restored in parallel by parsing out statements pertaining to those tables from the backup file, similar to Percona's mydumper.
Presumably you can only get this parallelism by disabling FK integrity. Is it re-enabled in the following VERIFY stage?
We always diff against the same base and have 5 days in between subsequent full dumps. The number of days just comes from a trade off between space occupied by the backups and time it takes to generate them.
> Is there no better way to handling this than polling?
There's definitely different ways to approach this, we find polling works well for us. We also use the same database for crash recovery, so doing the assignments through it serves both purposes.
> Presumably you can only get this parallelism by disabling FK integrity. Is it re-enabled in the following VERIFY stage?
I'm not sure what you mean by parallelism through disabling FK integrity. Splitting the backup into its tables means we can restore a subset of tables instead of the entire backup. This allows us to load individual tables concurrently, but also not have to wait to load a massive database if all we need is a few small tables.
Say you have a `user` table and a `post` table with `post.user_id` being a FOREIGN KEY on `user.user_id`. Without disabling FK integrity you would not be able to restore a post without restoring the user first. When restoring in parallel this might or might not work out.
I currently "fake it", using "START TRANSACTION WITH CONSISTENT SNAPSHOT", with multiple mysqldump processes running, where I can't get mydumper deployed.
I had to design against that particular problem recently, ended up taking a pt-table-checksum at the time-of-dump and verifying newly restored backups against that to ensure the backup's integrity.
Unfortunately that requires halting replication temporarily, so I was hoping to hear of a more ingenious solution.
We also implemented table checksums inside of mysqldump, allowing us to dump out the restored data and compare checksums as well, if required. https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0...
I'll definitely be seeing if I can replace my system with that; reducing the overhead to a single transaction would be a big win.