Instapaper's backup method (opens in new tab)

(marco.org)

166 pointshugoahlberg15y ago43 comments

43 comments

30 comments · 14 top-level

mseebach15y ago· 6 in thread

It seems unnecessarily exposed to an event affecting Marco's home - fire, burglary, natural disaster etc. It would appear more prudent to back up to a cloud location. Either, as he mentions, S3, or a VPS somewhere.

tlack15y ago

The problem with backing up with S3 is that if you ever stop paying for S3 you lose your backups. If you get sick for a month and your bank balance goes negative, or the IRS takes control over your account for something, there go your backups. I find that to be way more scary and likely than losing all of my DVD backups.

lordmatty15y ago

The cost of storage on Amazon is very low - roughly $3 per month for Marco's 22GB.

I imagine most people running a company would have a separate corporate account linked to a credit card, so that personal circumstances have less of a major effect month to month.

1 more reply

joshd15y ago

Another issue is that your S3 credentials are stored on your primary server. An attacker who gains access to that machine will also gain access to off site backups, and can completely destroy your business.

1 more reply

rscott15y ago

I agree completely. It doesn't seem wise, legally, to store one's business backups at home. Whether my thoughts are warranted or not I have no idea, but I'd just pay the small S3 fee and be on with it.

jschuur15y ago

One of his earlier tweets that led to this article suggests it's not just backed up at his home:

“I should blog about Instapaper's backup setup sometime. It's pretty extensive. A lot of places would need to burn down to lose your data.”

Maybe he just likes having a complete copy of the production data on his local development instance? Great for local data ming too.

marcelo-br215y ago

I agree with that.. Kinda expected a better setup for a real business.

lockesh15y ago· 2 in thread

Anyone else find this scheme completely atrocious?

1. Relying on a home computer on the critical path for data backup and persistence for a business

2. Relying on a high latency, low quality networking path between the slave db and the 'home mac' rather than a more reliable link between two machines in a datacenter.

3. A poor persistence model for long lived backups

4. No easy way to programatically recover old backups

What's even more disturbing is that this isn't a new problem. Its not like we don't know how to backup databases. This solution seems very poorly though out.

zdw15y ago

Regarding point #1 - Marco's "Home Computer" is a Mac Pro (per other posts he's made) - it has Xeon proceesors, ECC RAM, etc. Much closer to a server than what you can pick up at Best Buy for $399.

adambyrtek15y ago

It's not about performance nor price, but conditions in which the machine operates. Many servers used nowadays are cheaper than high-end desktop machines.

1 more reply

joshu15y ago· 2 in thread

on delicious, we had a thing that would serialize a user to disk for every day they were active. inactive users were not re-serialized.

this let us have day-to-day backups of individual users. this was necessary when broken clients would delete all the user's items. so we could easily restore an individual user (or do a historical recovery.)

bl4k15y ago

thats why I never have a DELETE in any query, only UPDATE and a state field (ie. deleted)

performance advantage here as well since indexes aren't rebuilt and no table lock

joshu15y ago

Indexes cab certainly update. Pretty sure innodb does not table lock for delete either.

Also from a privacy perspective you can't keep people's data around forever.

2 more replies

bl4k15y ago· 2 in thread

I don't think backing up the entire db to a laptop is a good idea, since laptops can get both lost and stolen. As somebody who uses the service, I am not super-comfortable with knowing that a full copy of my account and everything I save is sitting on a laptop somewhere.

It would be much better if these dumps were made to S3, or somewhere else that is actually in a secure datacenter (and a step that includes the word 'encryption').

larrywright15y ago

It's not explicitly stated in the article, but in the tweet that started it all[1], he mentioned it was a Mac Pro, rather than a laptop. So that's somewhat less likely to be stolen than a laptop that is taken out of the house regularly.

That said, I agree with you, and I hope it's at least encrypted.

[1] http://twitter.com/#!/marcoarment/status/6035374438621184

bl4k15y ago

ye not much better

there is a reason datacenters were built

thinking about this after I left my comment, having all that data on your local machine is just crazy - you are one browser exploit or break-in away from having it fall into somebody elses hands. It isn't professional for a web service to be doing this - esp one that is now charging some customers.

1 more reply

zbanks15y ago· 2 in thread

That's really an amazing system. Super redundant.

A relatively easy boost, which he briefly mentioned, would be to also store the data in S3. That should be easy enough to be automated, which could provide a a somewhat-reliable off-site backup.

However, Instapaper has the benefit of a (relatively) small DB. 22GB isn't too bad.I don't know how well this would scale to a 222GB DB with proportionally higher usage rates. It'd be possible, but it would have to be simplified, no?

jforman15y ago

I'd call S3 super-reliable rather than somewhat-reliable:

"Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities."

It's slow as a result...but that's the trade-off you're looking for in a backup.

http://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3

slaven15y ago

Yes, unless there is a billing problem with your account, someone is causing a problem with your account, etc. As an Amazon S3 and AWS user I would say they are fairly reliable overall - nowhere near super-reliable!

ludwigvan15y ago· 1 in thread

[Disclaimer: Instapaper fan here, so my opinions might be biased. It is probably the application I love the most on my iPad and iPod Touch. Thanks Marco!]

Marco has recently left his position as the CEO of Tumblr; and I think concentrates on Instapaper much more than ever (I assume it was mostly a weekend project before, requiring simple fixes); therefore I have no doubt he will be making the service more reliable and better in the future (switch to S3 or similar).

Also, don't forget that Instapaper web service is currently free, although the iOS applications are not (There is a free lite version too.) There is a recently added subscription option (which AFAIK currently doesn't offer any additional thing); and I hope it will only make the service even better.

About security, I do not consider my Instapaper reading list as too confidential; so I don't have much trouble thinking the backup computer being stolen. Of course, your mileage might vary. As far as I know, even some accounts do not have passwords for Instapaper, you just login with your email address.

stumm15y ago

He was actually the CTO of tumblr.

dcreemer15y ago· 1 in thread

Are the primary and backup DBs in the same data center? If so, how would you restore from an "unplanned event" there? I ask because I faced that situation once years ago, and very quickly learned that uploading 10's of GB of data from an offsite backup will keep your site offline for hours.

In the end I ended up _driving_ a copy of the DB over to a data center. Adding a slaved-replica in another location is pretty easy these days.

ams611015y ago

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" -- Andrew Tanenbaum

ams611015y ago

The scenario he presents of being able to recover from an unintentionally broad delete or update query would seem to only work in the simplest of databases. He says:

- Instantiate the backup (at its binlog position 259) - Replay the binlog from position 260 through 999 - Replay the binlog from position 1001 through 1200 And you’ll have a copy of the complete database if that destructive query had never happened.

This only works if the changes in positions 1001-1200 were unaffected by the undesired changes in position 1000. Seems rather unlikely to me, but maybe in the case of his particular schema it works out.

rarrrrrr15y ago

FYI You could run either tarsnap or SpiderOak directly on the server for a prompt offsite backup. Both have excellent support for archiving many versions of a file, with de-duplication of the version stream, and no limits on how many historical versions are kept.

Also, "gzip --rsyncable" increases the compressed size by only about 1%, but makes deduplication between successive compressed dump files possible.

(I cofounded SpiderOak.)

rbarooah15y ago

Would the people who are upset that Marco is using his 'home' computer feel the same if he instead said it was at his office? Offices get broken into or have equipment stolen too - I'm not sure why people think this is so irresponsible given that he works from home now.

philfreo15y ago

I upvoted this not because I think personal laptops and Time Machine are a good process for db backups, but because making backups is still a huge pain and problematic area, so the more attention it gets, the better.

hugoahlbergOP15y ago

Marco has now updated his system with automatic S3 backup: http://www.marco.org/1630412230

japherwocky15y ago

are those binlogs timestamped? what wonderful graphs you could make!

konad15y ago

I just dump data into Venti and dump my 4gb Venti slices encrypted to DVD and keep an encrypted copy of my vac scores distributed around my systems.

If you're doing full dumps every few days, you're doing it wrong.

j / k navigate · click thread line to collapse

43 comments

30 comments · 14 top-level

mseebach15y ago· 6 in thread

tlack15y ago

lordmatty15y ago

The cost of storage on Amazon is very low - roughly $3 per month for Marco's 22GB.

I imagine most people running a company would have a separate corporate account linked to a credit card, so that personal circumstances have less of a major effect month to month.

1 more reply

joshd15y ago

1 more reply

rscott15y ago

jschuur15y ago

One of his earlier tweets that led to this article suggests it's not just backed up at his home:

“I should blog about Instapaper's backup setup sometime. It's pretty extensive. A lot of places would need to burn down to lose your data.”

Maybe he just likes having a complete copy of the production data on his local development instance? Great for local data ming too.

marcelo-br215y ago

I agree with that.. Kinda expected a better setup for a real business.

lockesh15y ago· 2 in thread

Anyone else find this scheme completely atrocious?

1. Relying on a home computer on the critical path for data backup and persistence for a business

2. Relying on a high latency, low quality networking path between the slave db and the 'home mac' rather than a more reliable link between two machines in a datacenter.

3. A poor persistence model for long lived backups

4. No easy way to programatically recover old backups

What's even more disturbing is that this isn't a new problem. Its not like we don't know how to backup databases. This solution seems very poorly though out.

zdw15y ago

Regarding point #1 - Marco's "Home Computer" is a Mac Pro (per other posts he's made) - it has Xeon proceesors, ECC RAM, etc. Much closer to a server than what you can pick up at Best Buy for $399.

adambyrtek15y ago

It's not about performance nor price, but conditions in which the machine operates. Many servers used nowadays are cheaper than high-end desktop machines.

1 more reply

joshu15y ago· 2 in thread

on delicious, we had a thing that would serialize a user to disk for every day they were active. inactive users were not re-serialized.

bl4k15y ago

thats why I never have a DELETE in any query, only UPDATE and a state field (ie. deleted)

performance advantage here as well since indexes aren't rebuilt and no table lock

joshu15y ago

Indexes cab certainly update. Pretty sure innodb does not table lock for delete either.

Also from a privacy perspective you can't keep people's data around forever.

2 more replies

bl4k15y ago· 2 in thread

It would be much better if these dumps were made to S3, or somewhere else that is actually in a secure datacenter (and a step that includes the word 'encryption').

larrywright15y ago

That said, I agree with you, and I hope it's at least encrypted.

[1] http://twitter.com/#!/marcoarment/status/6035374438621184

bl4k15y ago

ye not much better

there is a reason datacenters were built

1 more reply

zbanks15y ago· 2 in thread

That's really an amazing system. Super redundant.

A relatively easy boost, which he briefly mentioned, would be to also store the data in S3. That should be easy enough to be automated, which could provide a a somewhat-reliable off-site backup.

jforman15y ago

I'd call S3 super-reliable rather than somewhat-reliable:

It's slow as a result...but that's the trade-off you're looking for in a backup.

http://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3

slaven15y ago

ludwigvan15y ago· 1 in thread

[Disclaimer: Instapaper fan here, so my opinions might be biased. It is probably the application I love the most on my iPad and iPod Touch. Thanks Marco!]

stumm15y ago

He was actually the CTO of tumblr.

dcreemer15y ago· 1 in thread

In the end I ended up _driving_ a copy of the DB over to a data center. Adding a slaved-replica in another location is pretty easy these days.

ams611015y ago

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" -- Andrew Tanenbaum

ams611015y ago

The scenario he presents of being able to recover from an unintentionally broad delete or update query would seem to only work in the simplest of databases. He says:

rarrrrrr15y ago

Also, "gzip --rsyncable" increases the compressed size by only about 1%, but makes deduplication between successive compressed dump files possible.

(I cofounded SpiderOak.)

rbarooah15y ago

philfreo15y ago

hugoahlbergOP15y ago

Marco has now updated his system with automatic S3 backup: http://www.marco.org/1630412230

japherwocky15y ago

are those binlogs timestamped? what wonderful graphs you could make!

konad15y ago

I just dump data into Venti and dump my 4gb Venti slices encrypted to DVD and keep an encrypted copy of my vac scores distributed around my systems.

If you're doing full dumps every few days, you're doing it wrong.

j / k navigate · click thread line to collapse