I imagine most people running a company would have a separate corporate account linked to a credit card, so that personal circumstances have less of a major effect month to month.
“I should blog about Instapaper's backup setup sometime. It's pretty extensive. A lot of places would need to burn down to lose your data.”
Maybe he just likes having a complete copy of the production data on his local development instance? Great for local data ming too.
1. Relying on a home computer on the critical path for data backup and persistence for a business
2. Relying on a high latency, low quality networking path between the slave db and the 'home mac' rather than a more reliable link between two machines in a datacenter.
3. A poor persistence model for long lived backups
4. No easy way to programatically recover old backups
What's even more disturbing is that this isn't a new problem. Its not like we don't know how to backup databases. This solution seems very poorly though out.
this let us have day-to-day backups of individual users. this was necessary when broken clients would delete all the user's items. so we could easily restore an individual user (or do a historical recovery.)
performance advantage here as well since indexes aren't rebuilt and no table lock
Also from a privacy perspective you can't keep people's data around forever.
It would be much better if these dumps were made to S3, or somewhere else that is actually in a secure datacenter (and a step that includes the word 'encryption').
That said, I agree with you, and I hope it's at least encrypted.
[1] http://twitter.com/#!/marcoarment/status/6035374438621184
there is a reason datacenters were built
thinking about this after I left my comment, having all that data on your local machine is just crazy - you are one browser exploit or break-in away from having it fall into somebody elses hands. It isn't professional for a web service to be doing this - esp one that is now charging some customers.
A relatively easy boost, which he briefly mentioned, would be to also store the data in S3. That should be easy enough to be automated, which could provide a a somewhat-reliable off-site backup.
However, Instapaper has the benefit of a (relatively) small DB. 22GB isn't too bad.I don't know how well this would scale to a 222GB DB with proportionally higher usage rates. It'd be possible, but it would have to be simplified, no?
"Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities."
It's slow as a result...but that's the trade-off you're looking for in a backup.
Marco has recently left his position as the CEO of Tumblr; and I think concentrates on Instapaper much more than ever (I assume it was mostly a weekend project before, requiring simple fixes); therefore I have no doubt he will be making the service more reliable and better in the future (switch to S3 or similar).
Also, don't forget that Instapaper web service is currently free, although the iOS applications are not (There is a free lite version too.) There is a recently added subscription option (which AFAIK currently doesn't offer any additional thing); and I hope it will only make the service even better.
About security, I do not consider my Instapaper reading list as too confidential; so I don't have much trouble thinking the backup computer being stolen. Of course, your mileage might vary. As far as I know, even some accounts do not have passwords for Instapaper, you just login with your email address.
In the end I ended up _driving_ a copy of the DB over to a data center. Adding a slaved-replica in another location is pretty easy these days.
- Instantiate the backup (at its binlog position 259) - Replay the binlog from position 260 through 999 - Replay the binlog from position 1001 through 1200 And you’ll have a copy of the complete database if that destructive query had never happened.
This only works if the changes in positions 1001-1200 were unaffected by the undesired changes in position 1000. Seems rather unlikely to me, but maybe in the case of his particular schema it works out.
Also, "gzip --rsyncable" increases the compressed size by only about 1%, but makes deduplication between successive compressed dump files possible.
(I cofounded SpiderOak.)
If you're doing full dumps every few days, you're doing it wrong.