Postmortem of database outage of January 31 (opens in new tab)

(about.gitlab.com)

377 pointsmbrain9y ago257 comments

257 comments

152 comments · 35 top-level

illumin89y ago· 35 in thread

I have to say - if they were using a managed relational database service, like Amazon's RDS Postgres, this likely would have never happened. RDS fully automates nightly database snapshots, and ships archive logs to S3 every 5 minutes, which gives you the ability to restore your database to any point in time within the last 35 days, down to the second.

Also, RDS gives you a synchronously replicated standby database, and automates failover, including updating the DNS CNAME that the clients connect to during a failover (so it is seamless to the clients, other than requiring a reconnect), and ensuring that you don't lose a single transaction during a failover (the magic of synchronous replication over a low latency link between datacenters).

For a company like Gitlab, that is public about wanting to exit the cloud, I feel like they could have really benefited from a fully managed relational database service. This entire tragic situation could have never happened if they were willing to acknowledge the obvious: managing relational databases is hard, and allowed someone with better operational automation, like AWS, to do it for them.

renaudg9y ago

This is usually true, except when it's not :

I have personally experienced a near-catastrophic situation 3 years ago, where 13 out of 15 days' worth of nightly RDS MySQL snapshots were corrupt and would not restore properly.

The root cause was a silent EBS data corruption bug (RDS is EBS-based), that Amazon support eventually admitted to us had slipped through and affected a "small" number of customers. Unlucky us.

We were given exceptional support including rare access to AWS engineers working on the issue, but at the end of the day, there was no other solution than to attempt restoring each nightly snapshot one after the other, until we hopefully found one that was free of table corruption. The lack of flexibility to do any "creative" problem-solving operations within RDS certainly bogged us down.

With a multi-hundred gigabyte database, the process was nerve-wracking as each restore attempt took hours to perform, and each failure meant saying goodbye to another day's worth of user data, with the looming armageddon scenario that eventually we would reach the end of our snapshots without having found a good one.

Finally, after a couple of days of complete downtime, the second to last snapshot worked (IIRC) and we went back online with almost two weeks of data loss, on a mostly user-generated content site.

We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I still don't 100% trust cloud backups unless we also have a local copy created regularly.

koolba9y ago

> We got a shitload of AWS credits for our trouble, but the company obviously went through a very near-death experience, and to this day I don't 100% trust cloud backups unless we also have a local copy created regularly.

Cloud backups, and more generally all backups should be treated like nuclear profliferation treaties: Trust, but verify!

If your periodically restore your backups you'll catch this kind of crap when it's not an issue, rather than when shit had already hit the fan.

3 more replies

illumin89y ago

Wow, I'm sorry you experienced that. This points to the importance of regularly testing your backups. I hope AWS will offer an automated testing capability at some point in the future.

In the meantime, I hope you've developed automation to test your backups regularly. You could just launch a new RDS instance from the latest nightly snapshot, and run a few test transactions against it.

bogomipz9y ago

This is certainly true of all backups to an extent though not just the cloud. Back in the day of backing up to external tape storage it was important to test restores in case heads weren't calibrated or were calibrated differently between different tape machines etc.

I am curious did you manage to automate an restore smoke test after going through this?

mikiem9y ago

Snapshots are not backups, although many people use them as backups and believe they are good backups. Snapshots are snapshots. Only backups are backups.

2 more replies

YorickPeterse9y ago

RDS, or any hosted database solution, is not some kind of silver bullet that solves all problems. While it's true it takes care of backups automatically, it does also restrict you in terms of what you can do.

For example, you can't load custom extensions into RDS. Also, to the best of my knowledge RDS does not support a hot standby replica you can use for read-only queries, and replication between RDS and non RDS is also not supported. This means you can't balance load between multiple hosts, unless you're OK with running a multi-master setup (of which I'm not sure how well this would play out on RDS).

Most important of all, we ship PostgreSQL as part of our Omnibus package. As a result the best way of testing this over time is to use it ourselves, something we strive to do with everything we sihp. This means we need to actually run our own things. Using a hosted database would mean we wouldn't be using a part of what we ship, thus not being able to test it over time.

phonon9y ago

> Also, to the best of my knowledge RDS does not support a hot standby replica you can use for read-only queries

RDS has very nice Read Replicas.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_R...

For HA you can use High Availability (Multi-AZ).

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

4 more replies

carterehsmith9y ago

>> RDS does not support a hot standby replica you can use for read-only queries

This is not true anymore.

I set up two read-only RDS replicas, one in a different AWS region, and another in the same region, for read-only queries, just by clicking in AWS console.

wahnfrieden9y ago

You can use the failover standby replica for reads with Aurora at least. And you can manually via MySQL set up replication with non RDS, just not via AWS APIs.

nilved9y ago

RDS also comes with its own set of tradeoffs. There is no free lunch, and the cloud is just another word for someone else's server. There are reasons Gitlab opposes that.

discodave9y ago

In the meantime solution architects and sales people from AWS are going to run around with annotated copies of this public post-mortem to enterprises and say "look, RDS would have solved x,y,z and we can do that for you if you pay us"

1 more reply

carterehsmith9y ago

>> the cloud is just another word for someone else's server.

No. The cloud (AWS, GCE, Azure etc) is not "just" like your own server.

Just consider some basic details - you pay someone else to worry about things like power outages, disk failures, network issues, other hardware failures, and so on.

1 more reply

snackai9y ago

You are completely right. There are reasons to oppose the cloud, but maybe they should focus on improving their systems before moving out of the cloud. At this point in time it is clear that GitLab lacks the talent to run everything themselves. I mean 5 backups worthless or lost? You can't let interns write your backups system. After all backup is a large portion of their product.

2 more replies

overcast9y ago

This likely would have never happened, if one of their one hundred and sixty employees, just took the time to make sure backups were setup at all. You also need to be a sufficiently large enough organization to warrant the prices that the "cloud" services demand. As stated below, cloud computing is just someone else's server somewhere, and they are making lots of money doing it. Unless you need that level of scalability, and processing, then it's not worth it. I think Gitlab stated their entire PostgreSQL database was only a few hundred gigabytes. That's not exactly huge.

RhodesianHunter9y ago

"As stated below, cloud computing is just someone else's server somewhere. Unless you need that level of scalability, and processing, then it's not worth it."

I keep seeing people throw this around as if it's God's truth and it frustrates the hell out of me. It may be the case for your organization but everywhere I have worked (from startups to Fortune 500) the cloud allowed our engineers to focus on our product rather than infrastructure maintenance and contributed massively to our success.

1 more reply

memracom9y ago

RDS PostgreSQL is like the Hotel California, you can check in any time you want, but you can never leave. Maybe it is OK as a simple data store for a single app, but not for a real database. I gained a lot of my knowledge of PostgreSQL internals by helping my company get off of RDS and onto a dedicated EC2 instance solution. RDS imposes too many limitations.

Also, your snapshot backup solution is trivial to implement on EC2 or anywhere else for that matter. But it is not easy to do it right in some scenarios. Read https://www.postgresql.org/docs/9.6/static/backup-file.html for details. LVM or ZFS are likely needed under the db layer.

RhodesianHunter9y ago

"Maybe it is OK as a simple data store for a single app, but not for a real database."

Currently working at company number 2 with large (many terabytes) databases on RDS and can safely say this is horse shit.

The amount of time and energy it allows our engineers to spend on our actual products instead of database management is worth all of the extra cost and lock in and then some.

Edit: I just realized that you were talking about Postgres on RDS in particular. I don't have experience with Postgres so you may well be right.

2 more replies

atmosx9y ago

Could you describe the limitations you encounter in the RDS PSQL setup vs a self-managed?

seldo9y ago

Funny you should mention a managed relational database service; Instapaper uses one of those and had more than 12 hours of downtime this week: http://blog.instapaper.com/post/157027537441

No database solution is totally reliable. If storing data is my primary job, like it is GitLab's, I'd like to have as much control of it as possible.

illumin89y ago

Let's just say that Instapaper's outage was self-inflicted. You don't see them blaming their cloud provider, do you? People make mistakes, and even with a managed relational database service, you can still make mistakes.

The difference is that Instapaper was able to restore from backups, because their managed service performed them properly. The archive data is taking longer to restore, but that's due to design decisions Instapaper made.

imperialdrive9y ago

I'm typing off the top of my head, but didn't they have like 400GB of database? That would probably take 27 hours to get fully available via S3 at 32,000/kbps which is about what s3 will provide for first time hits in my experience.

bdcravens9y ago

RDS can restore a point-in-time snapshot in a couple of hours on databases over 1TB (speaking from experience)

qaq9y ago

RDS has severe performance limitations as in you can't provision more than 30K IOPS which is about 1/2 the performance of low end consumer SSD and about 1/20 the performance of a decent PCI-E SSD. You way better of running on decent dedicated hardware for the DB.

illumin89y ago

You can get 500K random reads per second and 100K random writes per second using RDS Aurora.

If you truly need more than 30K IOPs, I would recommend leveraging read-replicas, a Redis cache, and other solutions before just "throwing money at the problem" and purchasing a million IOPs.

1 more reply

yeukhon9y ago

I will add my inputs on RDS. I gave this comment on the GitLab incident thread. I actually managed to delete an RDS cloudformation stack by accident. The night before I pushed an update to Cloudformation to convert storage class to provisioned IOPS. Next morning I woke up really early, drove my girlfriend to work. While waiting in the car I wanted to check the status of the update so I went on AWS mobile app to check. Mind you I have iPhone 7, but the app was very slow and laggy. As I was scrolling down to find out the failure. But there was a lag between the screen render and my click. Damn. I clicked on delete. Yeah, fucking delete. No confirmation. It went through. No stop button.

There was no backup because the cfn template I built at the time did not have the flag that said take a final snapshot. If you do not take the final snapshot (via console, api, cfn) you are doomed: all the auto snapshots taken by AWS are deleted upon the removal of the RDS instance.

This was our staging db for one of our active projects which I and the dev team spent about a month working to get to staging and was under UAT. Fuck. I told my manager and he understood the impact so he just let me get started on rebuilding. The next morning I got the DB up and running since luckily I compiled my runbook when I first deployed it yo staging. But it was not fun because the data is synced via AWS DMS from our on premise Oracle db so I needed to get sign off from a number of departments.

So I learned my first lesson with RDS - make sure final snapshot flag is enabled (for EC2 user, please remind yourself anything stored on ephemeral storage are going to be loss upon a hard VM stop/start operation, so backup!!!).

I also learned that RDS is not truly HA in the case of upgrading servers, both minor and major upgrade. I've tested major upgrade and saw DB connection unavailable up to 10 min. In some minor version upgrades both primary and secondary had to be taken down.

Other small caveats such as auto minor version upgrade, maintenance windows, retention for automated snapshot are only up to 35 days, event logs in RDS console doesn't last for more than a day, converting to provisioned IOPS can be expensive are just some small annoyance or ugh kind of things I would encourage folks to pay close attention to. Oh yeah, also manual snapshots have to be managed by yourself, kind of obvious but there is no life cycle policy... building a read replica can take up to a day in my first attempt of ever creating a read replica.

Of course now I learned these lessons so we have auto and manual snapshots and a better schedule. I encourage you take the ownership of the upgrade even for minor version so you know how to design your applications to be better at fault tolerance.....in the end hing I liked RDS the most is the extensive free CW metrics available. I also recommended people not to use the mobile app and if you do, setup a read-only role / IAM user. The app is way too primitive and laggy. I still enjoy using RDS, the service is stable and quick to use, but just make sure you have the habit of backuping and take serious ownership and responsibility of the database.

illumin89y ago

There is no magic silver bullet that will let you upgrade a database without some minor amount of downtime. RDS minimizes this as much as possible by upgrading your standby database, initiating a failover, then creating a new standby. Clients will always be impacted because you have to, by definition, restart your database to be running the new version.

You can select your maintenance window, and you can defer updates as long as you want - nobody will force you to update, unless you check the "auto minor version update" box.

Please don't blame AWS for your lack of understanding of the platform. They try to protect you from yourself, and the default behavior of taking a final snapshot before deleting an instance is in both CloudFormation and the Console. If you choose to override those defaults, don't blame AWS.

1 more reply

cookiecaper9y ago

>So I learned my first lesson with RDS - make sure final snapshot flag is enabled (for EC2 user, please remind yourself anything stored on ephemeral storage are going to be loss upon a hard VM stop/start operation, so backup!!!).

This bit us once. Someone issued a `shutdown -h now` out of habit in an instance that was going for reboot, and it came back without its data, because "shutdown" is the same as "stop", and "stop" on ephemeral instances means "delete all my data". Since the command was issued from inside the VM, no warning or message that would've appeared on the EC2 console was displayed.

Amazon's position on ephemeral storage was shockingly unacceptable and unprofessional. They claimed they had to scrub the physical storage as soon as the stop button was pressed for security purposes, which is a complete cop-out. Of course they can't reallocate that chunk of the disk to the next instance while your stuff is on it, but they could've implemented a small cooldown period between stoppage, scrubbing, and reallocating the disk so that there would at least be a panic button and/or so accidental reboots-as-shutdowns don't destroy data. The only reason they didn't do that is because they didn't want to need to expand their infrastructure to accommodate it. Very sloppy, and not at all OK. That's not how you treat customer data.

Fortunately, AWS has moved on; I don't think that any new instances can be created with ephemeral storage anymore. Pure EBS now.

>I also learned that RDS is not truly HA in the case of upgrading servers, both minor and major upgrade. I've tested major upgrade and saw DB connection unavailable up to 10 min. In some minor version upgrades both primary and secondary had to be taken down.

You need multi-AZ for true HA. Failover within the same AZ has a small delay, as you've noted.

>I still enjoy using RDS, the service is stable and quick to use, but just make sure you have the habit of backuping and take serious ownership and responsibility of the database.

As many others in this thread have said, AWS and other cloud providers aren't a silver bullet. Competent people are still needed to manage these sorts of things. GitLab most likely would not have fared any better under AWS.

2 more replies

AlisdairO9y ago

Not sure how it works on cloudformation, but in the console and API you have to explicitly skip the final snapshot.

nhumrich9y ago

I know about the daily snapshots, but didn't know about the archive logs. Is this something I have to enable? How do I get the logs and how do I restore using them?

illumin89y ago

It's automatic. Go ahead and launch a new instance, restoring to a point in time (that's how you do restores in RDS). Notice that it gives you a calendar day/date/time fields where you can select the recovery point down to the second. This is enabled by replaying the archive logs to get you to the exact point in time.

1 more reply

robodale9y ago

Jesus christ, thanks for the infomercial.

bitshepherd9y ago

A great number of issues can be attributed to the selection of Azure as the platform of choice. That said, a little bird told me that the decision was largely a cost factor. "You get what you pay for" never rang more true.

tedunangst9y ago

But none of the issues were azure/cost related except for the slow recovery? I mean, neither AWS nor GCE can make you notice youre not getting cron mail.

1 more reply

ronack9y ago

Yes, I recall seeing a ticket that referenced Gitlab using Azure because it was heavily subsidized. My company uses Azure for much the same reason, and my experience has been largely positive.

MichaelGG9y ago

Is Azure cutting deals beyond the usual free 60k over a year or two or whatever it is for cool startups? Azure seems significantly more expensive in general, problems and slowness aside.

1 more reply

NPegasus9y ago· 17 in thread

  > Root Cause Analysis
  > [...]
  > [List of technical problems]

No, the root cause is you have no senior engineers who have been through this before. A collection of distributed remote employees, none of whom has enough experience to know any of the list of "Basic Knowledge Needed to Run a Website at Scale" that you list as the root causes. $30 million in funding and still running the company like a hobby project among college roommates.

Mark my words, the board members from the VC firms will be removed by the VC partners due to letting the kids run the show. Then VC firms will put an experienced CEO and CTO in place to clean up the mess and get the company on track. Unfortunately they will probably have wasted a couple years and be down to the last million $ before they take action.

ebiester9y ago

I am a senior engineer who has seen shit go down. I am quite literally the graybeard.

I am not a GitLab customer, I am not a startup junkie, and I'm usually considered one of the more conservative (in action, not politics) engineers in my peer group in technology adoption.

The cloud is just someone else's computer.

However, I've also seen graybeards who should have known better fuck something up. I've seen a team of smart people who in a moment of crisis made the wrong decision. I am currently in an organization that is full of careful people and have still seen data loss.

I went trawling through LinkedIn for GitLab employees, and they certainly have their fair share of senior engineers. If you want to fault them for being a remote company, that's fine, but is it that different than a fortune 500 company that has developers in the Bay Area, Austin, India, China, Budapest, and remote workers in other locations?

Or is a company only legitimate if it's in an open space in the Valley?

colechristensen9y ago

I couldn't agree with you more.

As your beard greys you realize absolutely everywhere is a mess. Everybody is an imposter and nothing matches the ideals you think should exist. The most capable people are just as prone to fat fingering critical commands as the greenhorns.

People have the wrong attitude towards failure and it's actually quite harmful. If you don't actively study it and make avoiding failure the #1 priority of your company you're absolutely doomed to commit a serious error at some point, and usually it's fine.

We're talking about a distributed version control system. Half the point is resilience to data loss. Compound that with the final result which was a site down for a day and loss of 6 hours of data. I've lost a day of work before to doing nothing and worked hard for a few hours and accidentally deleted it. If you haven't, you're probably lying to yourself. I didn't fall on my sword. It's just not that big of a deal. If it happens frequently? Sure. But it's going to happen once to a lot of people.

One of the very most important aspects to avoiding failure is being amiable when it happens. Fear of failure causes quite a bit of failure and stupid behavior to try to hide and avoid it.

I also simply don't understand the vitriol towards remote work.

cookiecaper9y ago

The issue is not that data loss occurred per se, nor is it that destructive accidents and oversights don't happen to senior people. The issue is that GitLab's surprisingly amateur and sloppy practices, many of which are blatantly obvious to people with a medium amount of ops experience, have bled through every aspect of this incident since it first occurred.

They didn't just lose data. They lost data and all of their actual backups were invalid. They had to restore from a system image that was taken for non-backup purposes, and, as luck would have it, was able to function as a backup in this instance. Not having working backups for months-long stretches rises to the level of negligence or incompetence from whomever is supposed to be supervising their infrastructure.

We all know that backups in the general sense are crucial and that they don't get done nearly often enough, but being lazy about backing up the home directory on your laptop is a lot different than allowing the company to sit without working backups for months.

I'm not saying that this doesn't happen to senior engineers who are victims of bad management, but qualified leadership doesn't allow it.

On top of that, it emerges that this condition occurred because they don't have good practices around when to log in to the master database server, they remove binary data directories before they pull down new copies, they don't know how to configure PgSQL and have to do a full standby resync after a couple of hours of high DB load because they don't have WAL archiving, replication slots, or even a semi-sane wal_keep_segments/min_wal_size set, they have no automated backup sanity check (let alone a schedule of human-verified backup restores) and other inadequate monitoring and alarming practices, and do I really need to go on? I could, because in this thread alone there are several other major faux pas mentioned.

I'm not sure how many of these sloppy, amateur errors you want to allow to stack on top of each other before you start thinking that GitLab is semi-responsible for this and that it's not within the typical senior-person margin of error, but it passed that threshold a long time ago for me.

GitLab severely underpays for any candidate not based in a top 10 real estate market, talking like 50-60% under market, because they punish candidates based on how much cheaper the real estate in their home market is than in New York City. The consensus is that this impedes their ability to obtain good talent and I would say that the events of the last couple of weeks have demonstrated that with spectacular clarity.

At least in my case, the impression has nothing to do with their operation as a remote company -- I'm a full-time remote worker and I learned about GitLab's atrocious salary formulae when I was checking them out as a potential employer because I wanted to move to an all-remote company (instead of the partially-remote company I work in now).

I'm sure that most of GitLab's engineers are good engineers relative to their experience levels. I'm also sure a small handful who accidentally align with their salary formula are senior in their particular fields. And I'm thirdly sure that no one with any inkling of experience in running a stable, reliable, production-level service and infrastructure has been allowed any fractional amount of influence in their infrastructure and deployment procedures.

gumby9y ago

> Mark my words, the board members from the VC firms will be removed by the VC partners

I have never, ever, seen this. Every firm has its own internal politics [1] but rarely if ever would they do this. They are more likely to just ignore it.

There is a belief that it takes a decade to tell if a VC is any good or not, and that includes "learning experiences" (all on the LP's dime of course).

> Then VC firms will put an experienced CEO and CTO in place

Now this I have seen. It even works sometimes (e.g. Eric Schmidt/Google).

[1] I have had a firm invest in which all decisions were made by a single partner. I also had a firm invest (a sizable sum!) in which other senior partners never met me and only learned what my company even did when I gave a presentation at one of their LP meetings. Also some very large funds allow senior partners to make small seed investments ("science projects") without formal approval from the partnership.

pfarnsworth9y ago

At a startup I worked at, the VP of Engineering was brilliant. He was probably the smartest person I've ever worked for, and the most hard working. He was online almost all hours of the day, working. He also insisted on a 9-to-5 schedule for all the engineers, because he believed that killing your engineers with work was not a scalable way to build a team. He was great.

But the first month I was there, I kept pressing him on what our disaster recovery plan was. His answers were weak at best. It was never tested, and he only had broad ideas of how much time it would take for a full recovery. I don't understand his reticence for testing full disaster recovery, but as everyone knows, unless you have tested DR, you don't have DR.

It was very scary, but in the several years I was there, we never had the database go down hard and lose data. But that was more blind luck than anything else. If we had a data outage, it would have probably been worse than the Gitlab's outage by far.

pzh9y ago

Maybe in his own cocky way, his disaster recovery plan was to be smart enough to never have a disaster. The problem with this approach is that it only works for a small company, and many people figure out the hard way that smarts don't scale.

1 more reply

pzh9y ago

I know somebody who knows the guy who deleted the data that caused the outage, and in his words (the guy who knows the GitLab employee), the GitLab engineer is one of the smartest people he's ever known--in fact, he's actually brilliant. So you can rest assured that the data loss wasn't caused by a an inexperienced kid.

grzm9y ago

You're aware he's in this thread, right? And the hearsay matches up with his comments, from what I've read.

1 more reply

elliotec9y ago

They also openly pay well below market rate, so they should expect to get what they pay for.

bogomipz9y ago

Can you elaborate on that?

For instance I just looked at their jobs listings and they advertise a range for annual compensation, the high end of that range looks about right for the locations I checked SF, and London in the UK for PE positions.

https://about.gitlab.com/jobs/production-engineer/

5 more replies

gech9y ago

Didn't Samsung have some phones catch on fire? Didn't Delta go down recently?

Cyph0n9y ago

Nah, let's just all board the anti-GitLab train. I honestly don't understand why the HN crowd has their panties in a knot. HN wasn't this bad even during the VW fiasco.

developer29y ago

>> the root cause is you have no senior engineers who have been through this before

They openly publish their database hostnames in this postmortem (db1.cluster.gitlab.com and db2.cluster.gitlab.com). These actually have public DNS that resolves. The last straw: port 22 on each server is running an open sshd server (the fact that password auth is disabled is of little consolation).

A production database server should NEVER HAVE a public IP address to start with. This is simply unacceptable and proves they don't have a single person qualified to handle infrastructure. Their only concern is that their developers can ssh into every production server without having to deal with vpns or firewalls.

Huge red flag that your data cannot be trusted.

milesrout9y ago

I can't tell if this is some kind of joke or not.

There's absolutely nothing wrong, whatsoever, with having a public IP address on a production database server.

2 more replies

p4lindromica9y ago

Exactly. The technical problems are not the root cause. How did the existing processes (or lack thereof) fail? How did the organization fail?

AsyncAwait9y ago

Like we haven't seen similar screwups at other, more 'proffesional' companies. They may shrug it under the rug better, but that's all.

dorianm9y ago

Exactly, I feel like it was nobody's job to make sure everything was resilient to failures

meowface9y ago· 14 in thread

>Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.

I could feel the sweat drops just from reading this.

I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.

lobster_johnson9y ago

Brings back memories, though not of anything I did. Quoting a comment I made on HN recently in a different thread:

---

Back in 2009 we were outsourcing our ops to a consulting company, who managed to delete our app database... more than once.

The first time it happened, we didn't understand what, exactly, had caused it. The database directory was just gone, and it seemed to have gone around 11pm. I (not they!) discovered this and we scrambled to recover the data. We had replication, but for some reason the guy on call wasn't able to restore from them -- he was standing in for our regular ops guy, who was away on site with another customer -- so after he'd struggled for a while, I said screw it, let's just restore the last dump, which fortunately had run an hour earlier; after some time we were able to get a new master set up, and although we had lost one hour of data, it was fortunately from a quiet period with very little writes. Everyone went to bed around 1am and things were fine, the users were forgiving, and it seemed like a one-time accident. The techs promised that setting up a new replication slave would happen the next day.

Then, the next day, at exactly 11pm, the exact same thing happened! This obviously pointed to a regular maintenance job as being the culprit. It turns out the script they used to rotate database backup files did an "rm -rf" of the database directory by accident. Again we scrambled to fix. This time the dump was 4 hours old, and there was no slave we could promote to master. We restored the last dump, and I spent the night writing and running a tool that reconstructed the most important data from our logs (fortunately we logged a great deal, including the content of things users were creating). I was able to go bed around 5am. The following afternoon, our main guy was called back to help fix things and set up replication. He had to travel back to the customer, and the last things he told the other guy was: "Remember to disable the cron job".

Then at 10pm... well, take a guess. Kaboom, no database. Turns out they were using Puppet for configuration management, and when the on-call guy had fixed the cron job, he hadn't edited Puppet; he'd edited the crontab on the machine manually. So Puppet ran 15 mins later and put the destructive cron job back in. This time we called everyone, including the CEO. The department head cut his vacation short and worked until 4am restoring the master from the replication logs.

We then fired the company (which filed for bankruptcy not too long after), got a ton of money back (we threatened to sue for damages), and took over the ops side of things ourselves. Haven't lost a database since.

lojack9y ago

I'm no sysadmin, and I know mistakes are inevitable and all... but I find this kind of mistake is unlikely to come from me. I feel as though a lot of developers are too nonchalant about production boxes. I think one or two close calls where I nearly did this exact thing served as a good wakeup call for me.

Steps I personally take to avoid this:

- Avoid prod boxes like the plague

- Set up a prompt (globally) to make it extremely obvious that you're in production. Something like a red background and black text saying "PRODUCTION"

- When changing data in production (DB's, config, etc) write a script (or just commands to copy and paste) and have that peer reviewed. If anything doesn't go to plan, treat it as a red flag. This serves a dual purpose of having a quick record of your actions without hunting through logs.

- Never ever leave open sessions

- Avoid prod boxes. This is important enough for me to say twice. Most of the time it can be avoided, especially if you use configuration management tools and write tools to perform common operations.

Now, lets just cross my fingers I don't jinx myself :-)

StreamBright9y ago

This sounds really cool. How about you are working for amazon or google and you have 1M production boxes? A nice prompt wont save you, change management will. Writing down the exact steps and reading it yourself and get others to read it and you execute it line by line. In my experience this is much better approach than avoid production boxes or have a red prompt and also scales to larger infrastructures. If something goes sideways (and in some cases it will) you can pinpoint the root cause quickly.

DigitalJack9y ago

It may be unlikely to come from you, but it's for damn sure that it'll never happen to said engineer again.

Also, I would make sure to have a different prompt than default for non-prod systems too. That way you know to be suspicious if it hasn't been changed from default.

viraptor9y ago

I don't think most of your points really apply though. They were setting up replication in production, so they had to work on production boxes. Setting prompt to say just "production" wouldn't help for the same reason. Production was intended.

Peer review though - yes. That could help. I wouldn't say "I'm unlikely to make that mistake" - it's likely to go on the famous last words list...

2 more replies

chrismorgan9y ago

I messed up an XP computer at home with `cd D:\backups\something; del /s * ` many years ago; `cd` without the /D flag doesn’t change the drive, so although D:\backups\something was the working directory on the D: drive, the working directory was still C:\WINDOWS\system32, and cmd.exe was running as administrator.

Fortunately disks were slower back then, so it hadn’t deleted too many files when I interrupted it, and the computer was able to be recovered without too much inconvenience.

rectang9y ago

For me, it was when I meant to execute this...

    rm -rf ~/foo

... but executed this instead:

    rm -rf ~ /foo

cjbprime9y ago

My worst data loss was:

    $ tar cvfz mbox outbox mbox.tar.gz

The argument order is backwards -- the output file is supposed to be first, then the input files.

On my system, this overwrote my full mailbox with a gzipped copy of my outbox, and a complaint that the mbox.tar.gz input file didn't exist.

That's right, the worst data loss happened while I was trying to take a backup. :(

2 more replies

BinaryIdiot9y ago

I did the Windows equivalent once a long time ago (I think it's deltree?) and I did it on a university computer system. It cleared out a TON of files and the computer itself pretty much stopped working. I had to hard turn it off.

Fortunately the University was using some tool that can re-image a computer each time it boots before hitting Windows so starting it back up and all the deleted system and application files were back.

1 more reply

dsavinkov9y ago

this is the reason why I always use quotes before specifying folders with rf -rf :)

2 more replies

meowface9y ago

For me it was a simple `rm -rf .`. Thought I was in ~/somefolder/, was actually in ~...

I stopped it after about 3 seconds, but that was enough to do critical damage.

a_t489y ago

Misread that as "The engineer was terminated" at first. Poor guy.

haldean9y ago

Potential outage prevention plan: put an alias on all production machines that emails HR to schedule a disciplinary meeting every time you run `rm -rf`.

INTPenis9y ago

At a small web host, early in my career, I once saw the boss blurr past my desk towards the server room. Throw open the big vault door and disappear inside.

Turns out he had accidentally executed an rm of the home dir on a major web server in the background so in panic, instead of killing the right pid, he just ran to the server and pulled the power cords. :D

Ended up restoring a few home dirs from tape.

KayEss9y ago· 8 in thread

The engineers still seem to have a physical server mindset rather than a cloud mindset. Deleting data is always extremely dangerous and there was no need for it in this situation.

They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.

Only when the replication is back in good order do you go through and kill the servers you no longer need.

The procedure for setting up these new servers should be based on the same scripts that spin up new UAT servers for each release. You spin up a server that is a near copy of production and then do the upgrade to new software on that. Only when you've got a successful deployment do you kill the old UAT server. This way all of these processes are tested time and time again and you know exactly how long they'll take and iron out problems in the automation.

matt_wulfeck9y ago

This type of thing always sounds good and all, but the reality is people get desperate and emotional when their website is down and everyone wants it up ASAP.

KayEss9y ago

I certainly don't disagree with that, but if you have this automated it is also the fastest way to get it back up and running. Besides, the site wouldn't have been down if they had this.

jjirsa9y ago

Desperate and emotional is no way to run a business

1 more reply

lomnakkus9y ago

> They should have spun up a new server to act as secondary the moment replication failed.

In a perfect world everything is cluster-ready &c at the outset. In this world it usually... isn't.

EDIT: ... and I'd posit that such cluster-readiness actually isn't worth it most of the time.

fredsir9y ago

You don't think it's worth it most the time because of the hassle of setting up and managing a cluster, or because clusters in and of it self is not necessary for most?

1 more reply

kuschku9y ago

That’s a nice idea, if you’re willing to pay the massive extra cost to actually rent all those overpriced systems.

For me, personally, going from cloud servers to rented dedicated servers cut my bill by 93% – more than an order of magnitude. At same performance.

In fact, it’d be cheaper to run 10x as many dedicated servers than to use cloud solutions for me.

theptip9y ago

It cut your cloud services bill by 93%, but how much did it increase your engineering bill by?

If your engineering time is free, then this calculation is complete. Otherwise it is not.

Does that 93% saving pay for a DB engineer, or enough of your developers' time to build the same quality of redundancy as you'd get with a DBaaS?

This calculus is going to be different for every DB and every company, but the OpEx impact of switching to dedicated servers is a bit more complex than you suggest above.

1 more reply

jjirsa9y ago

The fact that they're using a single master software is already an antiquated concept in 2017

yarper9y ago· 7 in thread

It's amazing how quickly it descends into "one of the engineers" did x or y. Who was steering this ship exactly?

It's really simple to point the finger and try to find a single cause of failure - but it's a fools errand - comparable to finding the single source behind a great success.

jdavis7039y ago

Do you expect management to be staring over your shoulder every time you do some kind of `rm` on a production server? With great power comes great responsibility.

yarper9y ago

People get tired, sick, frustrated, panic - part of being a responsible engineer is accepting you're as fallible as the next person and building in protection against your own errors.

However, if "the engineer" that caused this happens to read this, the above is not a sign that you should quit the profession and become a hermit. A chain of events caused this, you just happened to be the one without a chair to sit in when the music stopped.

1 more reply

damagednoob9y ago

Management is often at fault for not giving engineers the resources to do their job properly. How much of the 'Improving Recovery Procedures' were already highlighted but ignored? Were they pressured to deliver other features instead of bedding down some of their operations procedures?

I'm not saying this is the case here but it's all too easy to blame someone for making a mistake. Even the most experienced make mistakes but reducing your MTTR is often overlooked in favour of other seemingly more pressing concerns.

_ph_9y ago

It was not the managements job to prevent the engineer from typing "rm". It was the managements job to make sure that typing "rm" would not result in big data loss. This is, assuming the engineer was not already the highest ranking technical person in the company.

I am very happy about their open post-mortem, so that anyone can learn from it. Reading it, it looks to me that the "rm" was not the cause of the disaster, it just triggered it. The real problem was the whole setup, which failed. And that is something, which falls under managements responsibility.

chadcmulligan9y ago

doesn't everyone alias rm to rm -i on prod?

likewise all tty's have red backgrounds on prod.

1 more reply

pfarnsworth9y ago

90% of all outages are caused by human error. That's why change management solutions were so big 10 years ago, trying to get rid of the human element of changes in an enterprise.

angry_octet9y ago

They way to deal with that is not to crush the people, but to analyse the ways human decision making fails and to provide assistance. More often than not the failures occur because the people had not been trained correctly, which is a management problem not a PEBKAC one (how many junior systems people have been trained in how to recover from cascading disk/network/database failures?). ITIL is just a tool for ensuring failures which no one understands because understanding has been devalued.

ky7389y ago· 6 in thread

RIP the engineer

YorickPeterse9y ago

Still here, and doing just fine.

sslalready9y ago

I love your tag line here on HN! Some 15 yrs ago I was hired as a UNIX administrator at some larger company. Despite being fresh from school I already had plenty of experience from spending the 90's hacking and programming on whatever UNIX system fyodor or rootshell.com had an exploit for. When the DBA was leaving for vaccation they didn't hesitate in letting me take over his daily routines. On this particular summer day, I had a simple job: dump the production database and load the data into the test environment. I had to make sure the dump was finished before 4pm when the daily production run started (which, FWIW, continued well into the evening). This was an Oracle shop so I believe the commands were mostly "exp" and "imp" -- with the caveat that the imp command would need an additional parameter to select the test database instead of the production one that was the default.

Yeah, you see where this is going. The prod dump finished in time and shortly before leaving work I started importing the data. Then I sat around for a while before I realized I had forgotten about that additional "use the test environment" parameter -- and now I was importing a several hours old dump into the production database while the daily production run was running. I had to call company execs and explain the catastrophe to them, who in turn had to call in the vendor that sold us the system. Those were some pretty scary hours for a 20 yr old kid. Luckily it was just a matter of aborting the production run, reload the prod dump and then reschedule the production run for the day.

The next day I had to start my day at the vendor's place to get some shaming, but also a good piece of advice - "always say destructive things out loud before doing them". Then they continued to tell me stories of people they had worked with who really messed things up, and we all had some good, evil laughs.

Mistakes build experience, and hard learned lessons even more so. You now have a pretty good conversation starter to put on your CV. Personally I'd rather hire someone who was a "removal" specialist over someone who hadn't learned the skill yet. :)

I believe both GitLab and the community in general will come out stronger from this incident. Thank you all for being so transparent about it.

1 more reply

tuyguntn9y ago

Probably this fail makes Gitlab more reliable in the future. What doesn't kill you makes you stronger.

boulos9y ago

I enjoy your updated profile.

AsyncAwait9y ago

Actually, I'll be surprised if he hasn't received any offers by now. I would perhaps specifically hire him to deal with databases as I am pretty sure he's never going to make this mistake again.

YorickPeterse9y ago

I haven't received any offers so far. I don't intend on leaving GitLab any time soon either.

nodesocket9y ago· 5 in thread

My main question is still:

>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.

Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?

cookiecaper9y ago

The cause is bad configuration.

PgSQL, Mongo, and MySQL all use a transaction stream like this for replication and they all have to put some kind of cap on it or risk running out of disk space, but the cap should be made sufficiently large to allow automatic resumption of disconnected slaves without manual redumping, except in extraordinary circumstances. Log retention should be long enough to last at least a long weekend so that someone can come in and poke the DB back into action on Tuesday morning, but preferably more like 1 week. Alarms should be configured to fire well before replication lag gets anywhere near the log expiration timeout.

In particular, PostgreSQL has a feature that allows automatic WAL archiving (i.e., it confirms that the WAL has been successfully shipped to a separate system before it removes it from the master) and a feature called "replication slots" that ensures that all WALs are kept if a regular subscriber is offline. If either of these features had been correctly configured, there would've been no need to do a full resync; the secondary database would've come back and immediately picked up where it left off.

Additionally, if one must resync the full database (and I've had to do this many times), tools like pg_basebackup and innobackupex are basically required to consistently perform the process of pulling the master dumps, and the old (unsynced) data directory should be allowed to linger until the full master snapshot has been fully confirmed and is ready to resync. It's very reckless to go around removing binary data directories until you're certain that the new stuff is running, even if you're "just on the replicant".

With pg_basebackup, you run it on the replicant server and it streams down the files, no need to log into the master server at all. With innobackupex, you need to have read access to the master's binary data directory, but should achieve this safely through something like a read-only NFS mount. mydumper is a possible alternative to innobackupex that tries to capture the binlog coords and doesn't require any direct access to the host beneath the database server.

scurvy9y ago

+1 for replication slots. Just remember to remove the slot if you decommission a server; otherwise storage on the master will grow forever.

innobackupex works fine locally on the server, streaming out to netcat or ssh on the remote side. Nothing wild like read only NFS required. It also copies all binlogs. Mydumper is pretty old at this point and doesn't do most of the things innobackupex can. I wouldn't recommend it.

nodesocket9y ago

Wow, thanks. This is like the best answer I've ever seen. You absolutely nailed it.

Are you by any chance looking for any DevOps/Ops consulting? I just founded my third startup Elastic Byte (https://elasticbyte.net) and always looking for smart people. We're a consulting startup that helps companies manage their cloud infrastructure.

1 more reply

ploxiln9y ago

There's always some limit. At $PREVIOUS_JOB I think it was at least 48 hours, probably over 72 hours, of replication log retention (usually measured in GiB though). So it's surprising that in GitLab's case it must have been less than 6 hours (IIRC from the original google doc the slave had more than 4 hours replication lag due to load, initially ...)

scaryclam9y ago

There was a nice response from the PostgreSQL guys here: http://blog.2ndquadrant.com/dataloss-at-gitlab/

There is a bug that might have been hit, but it appears as though there were other issues at play as well.

atmosx9y ago· 4 in thread

Great to have a full-featured, professional post-mortem. Incidentally I work at a company that suffered data loss because of this outage and we're looking for ways to move out of GL.

My 2 cents... I might be the only one, but I don't like the way GL handled this case. I understand transparency as a core value and all, but they've gotten a bit too far.

IMHO this level of exposure has far-reaching, privacy implications for the ppl who work there. Implications that cannot be assessed now.

The engineer in question might have not suffered a PTSD, but some other engineer might haven been. Who knows how a bad public experience might play out? It's a fairly small circle, I'm not sure I would like to be part of a company that would expose me in a similar fashion, if I happen to screw up.

On the corporate side of things there is a saying in Greek: "Τα εν οίκω μη εν δήμω" meaning don't wash your dirty linen in public. Although they're getting praised by bloggers and other small-size startups, in the end of the day exposing your 6-layer broken backup policy and other internal flaws in between, while being funded at the tune of 25.62M in 4 rounds, does not look good.

sytse9y ago

Hi Panagiotis. I'm glad to hear you like the postmortem. I'm very sorry your company suffered data loss. If you want to move from GitLab.com please know that you can easily export projects and import them on a self-hosted instance https://gitlab.com/help/user/project/settings/import_export.... (and if in the future we regain your trust you can also go the other way).

It is not our intent to have one of our team members implicated by the transparency. That is why we redacted their name to team-member-1 and in any future incidents we'll do the same. It should be their choice to be identified or not. We are very aware of the stress that such a mistake might cause and the rest of the team has been very supportive.

I agree that we don't look good because of the broken backup policy. The way to fix that is to improve our processes. We recognize the risk to the company of being transparent, but your values are defined by what you do when it is hard.

JasonSage9y ago

This is a perfect response.

Every day I'm growing more to like GitLab. It took me way too long to realize that GitLab has a singular focus to change how people create and collaborate.

A person purely motivated on principle to see a specific change is going to find a way to make it happen. The hard part with such ideological ventures is that you have to have the business sense to make it sustainable. I'm gradually learning to recognize both aspects present in GitLab.

When you're guided on principle, it's much easier to accept losses here and there in the right way...

> If you want to move from GitLab.com please know that you can easily export projects and import them on a self-hosted instance (and if in the future we regain your trust you can also go the other way).

...and be able to stay focused on the bigger picture! Some customers were going to react this way no matter what. Sytse's response here characterizes GitLab's response as a whole here—we know we did wrong here, we learned from it, and we're going to be able to do a better job here on out regardless of whatever the fallout from the incident is.

Sytse, I love what you're doing and I look forward to seeing your continued resilience and dedication to your goal. The world needs more businesses like this.

1 more reply

atmosx9y ago

Thanks for the reply.

> It is not our intent to have one of our team members implicated by the transparency. That is why we redacted their name to team-member-1 and in any future incidents we'll do the same.

Great, good to know. I wish all the success in the world to you and everyone involved with Gitlab.

1 more reply

AsyncAwait9y ago

> your values are defined by what you do when it is hard.

Precisely.

Most companies would stay as quiet about this as possible, you guys remained transparent and this is why I'll remain a customer.

1 more reply

ancarda9y ago· 4 in thread

>Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late.

At my dayjob, we gradually stopped using email for almost all alerts, instead we have several Slack channels like #database-log where errors to MySQL go. Any cron jobs that fail post in #general-log. Uptime monitoring tools post in #status. So on...

Email has so much anti-spam stuff like DMARC that make it less reliable your mail will be delivered. For something failing like a backup or database query, it's too important to have potentially not reach someone who can make sure it gets fixed.

My 2 cents.

gl-9y ago

This is a step in the right direction but still misses a big part of it IMO: push versus pull notifications. If the agent stops functioning correctly or someone makes a config change, the alerts just stop and no one notices.

At the very least you want some kinda dead-man's switch that gets pissed if it's seen no events in the last x amount of time. Ideally you want to be polling the box in a stateful way; although with ephemeral nodes & flexible infra being all the rage that's fallen to the side a bit lately.

ancarda9y ago

Absolutely, that's a great idea!

You could also check for evidence a run has been successful, although that does depend on what you're doing exactly.

For our backup system, we're going to build an audit cron job on our main server that checks all our Azure containers to see if each server has pushed a file lately. It'll alert us if a file hasn't been uploaded in a few days or if it's smaller than a few MB (which is suspiciously small; we'd expect a few hundred MB for mysqldump+files).

mkopinsky9y ago

How do you set cron to post to Slack?

ancarda9y ago

We use Monolog (PHP library) that is able to post to Slack.

Messages in Monolog, like syslog have a level attached, so DEBUG, INFO, NOTICE and WARNING will only be written to a log file on disk. Anything higher, so ERROR, CRITICAL, ALERT or EMERGENCY will write to Slack (as well as log to disk). This means we only get notified of things failing and we can go on the server and see everything from DEBUG upwards which lets us mentally step through the cron job's run.

It's a very cool library. https://github.com/Seldaek/monolog

You can see the handlers here: https://github.com/Seldaek/monolog/tree/master/src/Monolog/H... which includes Slack, HipChat, IFTTT, Pushover, etc...

cookiecaper9y ago· 3 in thread

I really hate to pile on, but after reading through this whole thread and the whole post-mortem, there are a few basic things that are troubling besides the widely-acknowledged backup methodology. I don't see issues directly related to addressing these things.

1. notifications go through regular email. Email should be only one channel used to dispatch notifications of infrastructure events. Tools like VictorOps or PagerDuty should be employed as notification brokers/coordinators and notifications should go to email, team chat, and phone/SMS if severity warrants, and have an attached escalation policy so that it doesn't all hinge on one guy's phone not being dead.

2. there was a single database, whose performance problems had impacted production multiple times before (the post lists 4 incidents). One such performance problem was contributing to breakage at this very moment. I understand that was the thing that was trying to be fixed here, but what process allowed this to cause 4 outages over the preceding year without moving to the top of the list of things to address? Wouldn't it be wise to tweak the PgSQL configuration and/or upgrade the server before trying to integrate the hot standby to serve some read-only queries? And since a hot standby can only service reads (and afaik this is not a well-supported option in PgSQL), wouldn't most of the performance issues, which appear write-related, remain? The process seriously needs to be reviewed here.

And am I reading this right, the one and only production DB server was restarted to change a configuration value in order to try to make pg_basebackup work? What impact did that have on the people trying to use the site a) while the database was restarting, and b) while the kernel settings were tweaked to accommodate the too-high max_connections value? Is it normal for GitLab to cause intermittent, few-minute downtimes like that? Or did that occur while the site was already down?

3. Spam reports can cause mass hard deletion of user data? Has this happened to other users? The target in this instance was a GitLab employee. Who has been trolled this way such that performance wasn't impacted? What's the remedy for wrongly-targeted persons? It's clear that backups of this data are not available. And is the GitLab employee's data gone now too? How could something so insufficient have been released to the public, and how can you disclose this apparently-unresolved vulnerability? By so doing, you're challenging the public to come and try to empty your database. Good thing you're surely taking good backups now! (We're going to glance over the fact that GitLab just told everyone its logical DB backups are 3 days behind and that we shouldn't worry because LVM snapshots now occur hourly, and that it only takes 16 hours to transfer LVM snapshots between environments :) )

4. the PgSQL master deleted its WALs within 4 hours of the replica "beginning to lag" (<interrobang here>). That really needs to be fixed. Again, you probably need a serious upgrade to your PgSQL server because it apparently doesn't have enough space to hold more than a couple of hours of WALs (unless this was just a naive misconfiguration of the [min|max]_wal_size parameter, like the max_connections parameter?). I understand that transaction logs can get very large, but the disk needs to accommodate (usually a second disk array is used for WALs to ease write impact) and replication lag needs to be monitored and alarmed on.

There were a few other things (including someone else downthread who pointed out that your CEO re-revealed your DB's hostnames in this write-up, and that they're resolvable via public DNS and have running sshds on port 22), but these are the big standouts for me.

P.S. bonus point, just speculative:

Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds like a stretch. Some data may've been recoverable with some disk forensics. Especially if your Postgres server was running at the time of the deletion, some data and file descriptors also likely could've been extracted from system memory. Linux doesn't actually delete files if another process is holding their handle open; you can go into the /proc virtual filesystem and grab the file descriptor again to redump the files to live disk locations. Since your database was 400GB and too big to keep 100% in RAM, this probably wouldn't have been a full recovery, but it may have been able to provide a partial.

The theoretically best thing to do in such a situation would probably be to unplug the machine ASAP after ^C (without going through formal shutdown processes that may try to "clean up" unfinished disk work), remove the disk, attach it to a machine with a write blocker, and take a full-disk image for forensics purposes. This would maximize the ability to extract any data that the system was unable to eat/destroy.

In theory, I believe pulling the plug while a process kept the file descriptor open should keep you in reasonably good shape, as far as that goes after you've accidentally deleted 3/4 of your production database. The process never closes and the disk stops and the contents remain on disk, just pending unlink when the OS stops the process (this is one reason why it'd be important to block writes to the disk/be extremely careful while mounting; if the journal plays back, it may destroy these files on the next boot anyway). But someone more familiar with the FS internals would have to say definitively if it works that way or not.

I recognize that such speculative/experimental recovery measures may have been intentionally forgone since they're labor intensive, may have delayed the overall recovery, and very possibly wouldn't have returned useful data anyway. Mentioning it mainly as an option to remain aware of.

mschuster919y ago

> Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds like a stretch.

That only depends on the # of files. If it's even a thousand files, any modern Linux rm -rf will remove them in less time than a blink.

> The theoretically best thing to do in such a situation would probably be to unplug the machine ASAP after ^C (without going through formal shutdown processes that may try to "clean up" unfinished disk work), remove the disk, attach it to a machine with a write blocker, and take a full-disk image for forensics purposes. This would maximize the ability to extract any data that the system was unable to eat/destroy.

Their infrastructure is cloud based. No way to get a physical disk - if there is a "disk" at all and not a couple of huge fat NetApp filers providing the storage to the CPU nodes. (This is how a couple web-hosters operate)

YorickPeterse9y ago

    > 1. notifications go through regular email. Email should be only one
    > channel used to dispatch notifications of infrastructure events. Tools
    > like VictorOps or PagerDuty should be employed as notification
    > brokers/coordinators and notifications should go to email, team chat, and
    > phone/SMS if severity warrants, and have an attached escalation policy so
    > that it doesn't all hinge on one guy's phone not being dead.

This is discussed in https://gitlab.com/gitlab-com/infrastructure/issues/1095. We're basically looking into moving to Prometheus monitoring combined with Pagerduty notifications.

    > 2. there was a single database, whose performance problems had impacted
    > production multiple times before (the post lists 4 incidents). One such
    > performance problem was contributing to breakage at this very moment. I
    > understand that was the thing that was trying to be fixed here, but what
    > process allowed this to cause 4 outages over the preceding year without
    > moving to the top of the list of things to address?

High availability has been a thing we wanted to do for a while, but for whatever reason we just never got to it (until recently). Not sure exactly why.

    > Wouldn't it be wise to tweak the PgSQL configuration and/or upgrade the
    > server before trying to integrate the hot standby to serve some read-only
    > queries?

The server itself is already quite powerful, and the settings should be fairly decent (e.g. we used pgtune, and spent quite a bit of time tweaking things). The servers currently have 32 cores, 440-something GB of RAM, and the disk containing the DB data uses Azure premium storage with around 700 GB of storage (we currently use 340).

    > And since a hot standby can only service reads (and afaik this is not a
    > well-supported option in PgSQL), wouldn't most of the performance issues,
    > which appear write-related, remain? The process seriously needs to be
    > reviewed here.

Based on our monitoring data we have vastly more reads than writes. This means load balancing gets very interesting. Hot standby is also supported just fine out of the box, you just need something third party for the actual load balancing.

    > And am I reading this right, the one and only production DB server was
    > restarted to change a configuration value in order to try to make
    > pg_basebackup work?

We suspect so. Chef handles restarting processes and we think it's currently still set to do a hard restart always, instead of doing a reload whenever possible.

    > What impact did that have on the people trying to use the site a) while the
    > database was restarting

A few minutes of downtime as the DB is unavailable.

    > and b) while the kernel settings were tweaked to accommodate the too-high
    > max_connections value?

No. We now reduced the max_connections to a lower value (1000) so we still have enough but don't to tweak any kernel settings.

    > Is it normal for GitLab to cause intermittent, few-minute downtimes like
    > that? Or did that occur while the site was already down?

We've had a few too many cases like this in the past. We're aiming to resolve those, but unfortunately this is rather tricky and time consuming.

    > 3. Spam reports can cause mass hard deletion of user data?

Yes.

    > Has this happened to other users?

Not that I know of.

    > What's the remedy for wrongly-targeted persons?

A better abuse system, e.g. one that makes it easier to see _who_ was reported. We're also thinking of adding a quorom kind of feature: to remove users more than 3 people need to approve it, something like that.

    > And is the GitLab employee's data gone now too?

No. The removal procedure was throwing errors, causing it to roll back its changes. This kept happening, which prevented the user from being removed. So ironically an error saved the day here.

    > How could something so insufficient have been released to the public

Code is written by developers, and developers are humans. Humans in turn make mistakes. Most project removal related code also existed before we started enforcing stricter performance guidelines.

    > and how can you disclose this apparently-unresolved vulnerability? By so
    > doing, you're challenging the public to come and try to empty your database

There's no point in hiding it. Spending a few minutes digging through the code and you'll find it, and probably plenty other similar problems. If somebody tries to abuse it we'll deal with it on a case by case basis.

    > because LVM snapshots now occur hourly, and that it only takes 16 hours to
    > transfer LVM snapshots between environments :)

LVM snapshots are stored on the host itself. As such if e.g. db1 loses data we can restore the snapshot in a few minutes. They only have to be transferred if we want to recover other hosts. Furthermore, in the Azure ARM environment the file transfer would be much faster compared to the classic environment.

    > 4. the PgSQL master deleted its WALs within 4 hours of the replica
    > "beginning to lag" (<interrobang here>). That really needs to be fixed.

Yes, which is also something we're looking into.

    > Again, you probably need a serious upgrade to your PgSQL server because it
    > apparently doesn't have enough space to hold more than a couple of hours of
    > WALs (unless this was just a naive misconfiguration of the
    > [min|max]_wal_size parameter, like the max_connections parameter?)

Probably just a naive configuration value since we have plenty of storage available.

    > There were a few other things (including someone else downthread who pointed
    > out that your CEO re-revealed your DB's hostnames in this write-up, and that
    > they're resolvable via public DNS and have running sshds on port 22), but
    > these are the big standouts for me.

Revealing hostnames isn't really a big deal, neither is SSH running on port 22. In the worst case some bots will try to log in using "admin" usernames and the likes, which won't work. All hosts use public key authentication, and password authentication is disabled.

    > Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds
    > like a stretch.

Nope, after about 2 seconds the data was gone. Context: I ran said command.

    > Some data may've been recoverable with some disk forensics.

When using psycial disks not used by anything else, maybe. However, we're talking about disks used in a cloud environment. Are they actually physical? Are they part of larger disks shared with other servers? Who knows. The chance of data recovery using special tools in a cloud environment is basically zero.

    > Especially if your Postgres server was running at the time of the deletion,
    > some data and file descriptors also likely could've been extracted from
    > system memory

That only works for files still held on to by PostgreSQL. PostgreSQL doesn't keep all files open at all times, so it wouldn't help.

cookiecaper9y ago

> Hot standby is also supported just fine out of the box, you just need something third party for the actual load balancing.

I'm aware that hot standby is supported, though it's not the default configuration for the standby server (default and safest is a standby mode that you can't query at all; hot standby introduces possible conflicts between hot read queries and write transactions coming in from the WAL, so if failover is your primary intention, you should be cold standbying). I'm saying that mixing read queries in and dispersing them over hot standbys is not well-supported, which is why you need third-party tools to do it.

It can also be risky if your replication lag gets out of control, and you've indicated that it easily does. PgSQL replication is eventually consistent and you risk returning stale data on reads, which could cause all sorts of havoc if it's not accounted for by the application internally.

> We've had a few too many cases like this in the past. We're aiming to resolve those, but unfortunately this is rather tricky and time consuming.

This may take some upfront work, but it's pretty routine. A serious commercial-level offering should not need to take itself offline without announcement in order to restart the single database server and apply a configuration tweak.

> Code is written by developers, and developers are humans. Humans in turn make mistakes. Most project removal related code also existed before we started enforcing stricter performance guidelines.

The point is not that humans make mistakes, nor that bugs exist. The point is that such a feature was released without considering its easily-exploitable potential and the permanent consequences of its exploitation (permanent removal of data). That should trigger a process review.

> There's no point in hiding it. Spending a few minutes digging through the code and you'll find it, and probably plenty other similar problems. If somebody tries to abuse it we'll deal with it on a case by case basis.

There's a lot of risk in drawing attention to this type of vulnerability. I think GitLab should be taking this more seriously. All code has bugs, but this isn't a bug; it's an incomplete, dangerously-designed feature that can be easily used by a malicious actor to permanent destroy large quantities of user data. Your CEO has just highlighted it before the whole world while it's still active and exploitable on the public web site.

Reading the code isn't a dead giveaway because it takes a lot of effort to find the specific code in question and realize what it means, and because the general assumption would be that GitLab.com is running a souped-up or specialized flavor of the code and that such dangerous design flaws must have already been resolved on a presumably high-traffic site. However, this post highlights that it hasn't been, and that's bad. This is effectively irresponsible self-disclosure of a very high-grade DoS exploit.

> Probably just a naive configuration value since we have plenty of storage available.

Having the storage readily available means that the hard part is already done! Each WAL segment is 16MB. You have about 350 GB of unused disk. Set wal_keep_segments and min_wal_size to something reasonable and you won't need to do this obviously-risky resync operation every time you have a couple of hours of heavy DB load.

> Revealing hostnames isn't really a big deal, neither is SSH running on port 22. In the worst case some bots will try to log in using "admin" usernames and the likes, which won't work. All hosts use public key authentication, and password authentication is disabled.

See discussion at https://news.ycombinator.com/item?id=13621027. The worst case is not a bruteforced login, it's an exploited daemon that leads to an exploited box that leads to an exploited network that leads to an exploited company. The secondary concern would be a DoS attack; everyone now knows that you have only one functioning database server that everything depends on, and that that server's IP is x.x.y.y. That's enough to cause trouble even without exploits or zero days.

> When using psycial disks not used by anything else, maybe. However, we're talking about disks used in a cloud environment. Are they actually physical? Are they part of larger disks shared with other servers? Who knows. The chance of data recovery using special tools in a cloud environment is basically zero.

Yes, this complicates things significantly. Something like EBS may be able to be used pretty similarly to a dd image, though there is no way to "pull the plug" on an EC2 server afaik (maybe it's exposed through the API). I've never used Azure so I don't know if this would be practicable there.

> That only works for files still held on to by PostgreSQL. PostgreSQL doesn't keep all files open at all times, so it wouldn't help.

Indeed. While PgSQL doesn't keep all files open at all times, it does keep some files open, and they may or may not have contained useful data. I personally would've also been interested in trying to freeze the memory state (something you can do with a lot of raw VMs that you can't do with physical servers, but admittedly probably not something the cloud provider exposes).

1 more reply

gr20209y ago· 2 in thread

Reading this, the thing that stuck out to me was how remarkably lucky they were to have the two snapshots. The one from 6 hours earlier was there seemingly by chance, as an engineer had created it for unrelated reasons. And for both the 6- and 24-hour snapshots, it seems just lucky that neither had any breaking changes made to them by pre-production code (they _were_ dev/staging snapshots, after all).

I'm glad it all worked out in the end!

sytse9y ago

We too are glad we had those snapshots. And while it was the worst thing that ever happend at GitLab it is humbling to know that it could have been worse.

animex9y ago

What do you think would have happened if you had total data loss/failure?

1 more reply

Achshar9y ago· 2 in thread

Does anyone have a link to the YouTube stream they's talking about? Can't seem to find it on their channel. And the link in the doc is redirecting to the live link [1] which doesn't list the stream.

[1] - https://www.youtube.com/c/Gitlab/live

cmatija9y ago

Yup,

Thanks for taking the interest to check it out.

It's an unlisted YT video, so that's why it might be hard to find.

Here it is: https://www.youtube.com/watch?v=nc0hPGerSd4

Achshar9y ago

Thanks! I'll check it out.

tschellenbach9y ago· 2 in thread

Shouldn't the conclusion of this post mortem be a move to a managed database service like RDS? The database doesn't sound huge, RDS is affordable enough, sounds to me that you spend less money and have better uptime and sleep by moving away from this in-house solution.

carterehsmith9y ago

According to this article:

> https://www.theregister.co.uk/2016/11/14/gitlab_to_dump_clou...

...they were moving away from the cloud, to their own servers.

sytse9y ago

We changed our minds. We're working on a post in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/... but we wanted to post this first.

1 more reply

dustinmoris9y ago· 2 in thread

Watching GitLab is somewhat painful. I feel like they make every possible mistake you could do as an IT startup and because they are transparent about it people seem to love the fact that they screw up all the time. I don't know if I share the same mentality, because at the end of the day I don't trust GitLab even with the simplest task, let alone any valuable work of mine.

It's good to be humble and know that mistakes can happen to anyone and learn from it, etc., but when you do in 2017 still the same stupid mistakes that people did a million times since 1990 and it's all well documented and there's systems built to avoid these same basic mistakes and you still do them today then I just think it cannot be described any different than absolute stupidity and incompetence.

I know they have many fans who just look past every mistake no matter how bad it was only because they are open about it, but common, this is now just taking the piss no?

egwor9y ago

I had exactly the same feeling. One of my friends was a doctor at a hospital and there was a serious mistake which my friend reported to the consultant. The consultant made a good point that it was never just one mistake that caused the serious mistake; it was a series of smaller mistakes that hadn't been checked or addressed. (the argument was that there were processes to avoid these kinds of mistakes) If you read the post they also noted that they were also in the process of accidentally removing one of their own employees which make me think that there are other problems going on here.

Maybe part of the problem is that I think that the industry has a problem that it isn't willing to learn from the experiences of others? (I feel like we have 'just enough learning' and 'experienced folk who raise concerns are considered stuck in their outdated ways' and 'people who make a silly mistake like that must be an idiot'). I think that since we clump together those that have had formal training with those that haven't we aren't encouraging the value of this education. I'm also fully aware that some self-learned developers are much more competent rather than a college educated developer.

dustinmoris9y ago

I don't actually think it has anything to do with education. I think it really comes down to common sense.

Any half intelligent engineer would always first research good practices, pitfalls and existing information which has been gathered from decades of other experienced engineers before doing anything stupid on their own. It seems that GitLab is lacking this attitude.

2 more replies

nowarninglabel9y ago· 1 in thread

Thanks so much for the post and transparency Gitlab! We had just finished recovering from our own outage (stemming for a power loss and subsequent cascading failures) and were scheduled to do our post-mortem on 2/1 so the original document was a refreshing and reassuring read.

sytse9y ago

Glad to hear it was of use to you.

_Marak_9y ago· 1 in thread

I've noticed a lot of other positive activity and press for Gitlab for in the past month.

It's unfortunate they had this technical issue, but it's good to see others ( besides Github ) operating in this space. I should give Gitlab a try sometime.

sperglord9y ago

We just switched from Github to Gitlab for our private repos. The choice (based upon cost alone) was between them and Bitbucket, and the professional way that this was handled and the transparent communication was really nice to see.

isoos9y ago· 1 in thread

sytse and GitLab folks: thank you for the transparency.

sytse9y ago

You're welcome. Thanks for all the kind responses we received https://twitter.com/i/moments/826818668948549632

samat9y ago· 1 in thread

Am I missing something or didn't they mention 'test recovery, not backups'?

sytse9y ago

Two of the issues linked from the article deal with testing the backups:

- Automated testing of recovering PostgreSQL database backups https://gitlab.com/gitlab-com/infrastructure/issues/1102

- Build Streaming Database Restore https://gitlab.com/gitlab-com/infrastructure/issues/1152

grhmc9y ago· 1 in thread

Thank you for redacting who the engineer was. Great write-up. Thank you!

sytse9y ago

You're very welcome. By the way the engineer was cool with being named, but we think this is good practice for future postmortems. I do hope it doesn't come soon.

EnFinlay9y ago· 1 in thread

Most destructive troll ever.

jolopes9y ago

Glad you read it till the end.

greenrd9y ago

GitHub also lost a bunch of PRs and issues sitewide early in their history. They claimed to have restored all the PRs from backup, but I was pretty sure I had opened a PR and it never came back. I emailed support and they basically told me tough luck.

matt_wulfeck9y ago

> Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.

I can only image this engineer's poor old heart after the realization of removing that directory on the master. A sinking, awful feeling of dread.

I've had a few close calls in my career. Each time it's made me pause and thank my luck it wasn't prod.

aabajian9y ago

This is an outstanding writeup, but I wonder if it glosses over the real problem:

>>The standby (secondary) is only used for failover purposes.

>>One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup.

IMO, secondaries should be treated exactly as their primaries. No operation should be done on a secondary unless you'd be OK doing that same operation on the primary. You can always create another instance for these operations.

voidlogic9y ago

>When we went to look for the pg_dump backups we found out they were not there. The S3 bucket was empty, and there was no recent backup to be found anywhere. Upon closer inspection we found out that the backup procedure was using pg_dump 9.2, while our database is running PostgreSQL 9.6 (for Postgres, 9.x releases are considered major). A difference in major versions results in pg_dump producing an error, terminating the backup procedure.

Yikes. One common practice that would have avoided this is by using the just taken backup to populate stage. If the restore fails pages go out. If integration tests that run after a successful restore/populate fail- pages go out.

Live and learn I guess.

pradeepchhetri9y ago

Just want to add here that using tools like safe-rm[1] across your infrastructure would help in preventing data losses by running rm on unintended directories.

[1]: https://launchpad.net/safe-rm

jsperson9y ago

>An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.

This is a great attitude. Too often opportunity cost isn't considered when making rules to protect folks from doing something stupid.

XorNot9y ago

The backup situation stands out to me as a problem no one has really adequately solved. Verifying a task has happened in a way where the notifications are noticed is actually a really hard problem that it feels like we collectively ignore in this business.

How do you reliably check if something didn't happen? Is the backup server alive? Did the script work? Did the backup work? Is the email server working? Is the dashboard working? Is the user checking their emails (think: wildcard mail sorting rule dumping a slight change in failure messages to the wrong folder).

And the converse answer isn't much better: send a success notification...but if it mostly succeeds, how do you keep people paying attention to it when it doesn't (i.e. no failure message, but no success message)?

The best answer I've got, personally, is to use positive notifications combined with visibility - dashboard your really important tasks with big, distinctive colors - use time based detection and put a clock on your dashboard (because dashboards which mostly don't change might hang and no one notice).

dancryer9y ago

Can't help but notice that the new backup monitoring tool suggests that the latest PGSQL backup is almost six days old...

Is that correct? http://monitor.gitlab.net/dashboard/db/backups?from=14859419...

nierman9y ago

yes, wal archiving would have helped (archive_command = rsync standby ...), but it's also very easy in postgres 9.4+ to add a replication slot on the master so that wal is kept until it is no longer needed by the standby. simply reference the slot in the standby's recovery.conf file.

definitely monitor your replication lag--or at least disk usage on the master--with this approach (in case wal starts piling up there).

nstj9y ago

@sytse were you in contact with MS/Azure during the restore? If so did they offer any assistance, e.g in speeding up restoration disk speed etc

jsingleton9y ago

TIL GitLab runs on Azure. If your CI servers or deployment targets are also on Azure then the latency should be pretty low (assuming you get the correct region). Good to know.

I moved from AWS to Azure years ago. Mainly because I run mostly .NET workloads and the support is better. I've recently done some .NET stuff on AWS again and am remembering why I switched.

AlexCoventry9y ago

Thank you for this informative postmortem and mitigation outline.

Are any organizational changes planned in response to the development friction which led to the outage? It seems to have arisen from long-standing operational issues, and an analysis of how prior attempts to address those issues got bogged down would be very interesting.

oli56799y ago

I found this entertaining, even if they did later admit that it was a hoax:

http://serverfault.com/questions/587102/monday-morning-mista...

khazhou9y ago

Every internal Ops manual needs to begin with the simple phrase:

DON'T PANIC

encoderer9y ago

If you want to up your cron job monitoring game there's a link in my profile.

j / k navigate · click thread line to collapse

257 comments

152 comments · 35 top-level

illumin89y ago· 35 in thread

renaudg9y ago

This is usually true, except when it's not :

I have personally experienced a near-catastrophic situation 3 years ago, where 13 out of 15 days' worth of nightly RDS MySQL snapshots were corrupt and would not restore properly.

The root cause was a silent EBS data corruption bug (RDS is EBS-based), that Amazon support eventually admitted to us had slipped through and affected a "small" number of customers. Unlucky us.

Finally, after a couple of days of complete downtime, the second to last snapshot worked (IIRC) and we went back online with almost two weeks of data loss, on a mostly user-generated content site.

koolba9y ago

Cloud backups, and more generally all backups should be treated like nuclear profliferation treaties: Trust, but verify!

If your periodically restore your backups you'll catch this kind of crap when it's not an issue, rather than when shit had already hit the fan.

3 more replies

illumin89y ago

Wow, I'm sorry you experienced that. This points to the importance of regularly testing your backups. I hope AWS will offer an automated testing capability at some point in the future.

bogomipz9y ago

I am curious did you manage to automate an restore smoke test after going through this?

mikiem9y ago

Snapshots are not backups, although many people use them as backups and believe they are good backups. Snapshots are snapshots. Only backups are backups.

2 more replies

YorickPeterse9y ago

phonon9y ago

> Also, to the best of my knowledge RDS does not support a hot standby replica you can use for read-only queries

RDS has very nice Read Replicas.

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_R...

For HA you can use High Availability (Multi-AZ).

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

4 more replies

carterehsmith9y ago

>> RDS does not support a hot standby replica you can use for read-only queries

This is not true anymore.

I set up two read-only RDS replicas, one in a different AWS region, and another in the same region, for read-only queries, just by clicking in AWS console.

wahnfrieden9y ago

You can use the failover standby replica for reads with Aurora at least. And you can manually via MySQL set up replication with non RDS, just not via AWS APIs.

nilved9y ago

RDS also comes with its own set of tradeoffs. There is no free lunch, and the cloud is just another word for someone else's server. There are reasons Gitlab opposes that.

discodave9y ago

1 more reply

carterehsmith9y ago

>> the cloud is just another word for someone else's server.

No. The cloud (AWS, GCE, Azure etc) is not "just" like your own server.

Just consider some basic details - you pay someone else to worry about things like power outages, disk failures, network issues, other hardware failures, and so on.

1 more reply

snackai9y ago

2 more replies

overcast9y ago

RhodesianHunter9y ago

"As stated below, cloud computing is just someone else's server somewhere. Unless you need that level of scalability, and processing, then it's not worth it."

1 more reply

memracom9y ago

RhodesianHunter9y ago

"Maybe it is OK as a simple data store for a single app, but not for a real database."

Currently working at company number 2 with large (many terabytes) databases on RDS and can safely say this is horse shit.

The amount of time and energy it allows our engineers to spend on our actual products instead of database management is worth all of the extra cost and lock in and then some.

Edit: I just realized that you were talking about Postgres on RDS in particular. I don't have experience with Postgres so you may well be right.

2 more replies

atmosx9y ago

Could you describe the limitations you encounter in the RDS PSQL setup vs a self-managed?

seldo9y ago

Funny you should mention a managed relational database service; Instapaper uses one of those and had more than 12 hours of downtime this week: http://blog.instapaper.com/post/157027537441

No database solution is totally reliable. If storing data is my primary job, like it is GitLab's, I'd like to have as much control of it as possible.

illumin89y ago

imperialdrive9y ago

bdcravens9y ago

RDS can restore a point-in-time snapshot in a couple of hours on databases over 1TB (speaking from experience)

qaq9y ago

illumin89y ago

You can get 500K random reads per second and 100K random writes per second using RDS Aurora.

If you truly need more than 30K IOPs, I would recommend leveraging read-replicas, a Redis cache, and other solutions before just "throwing money at the problem" and purchasing a million IOPs.

1 more reply

yeukhon9y ago

illumin89y ago

You can select your maintenance window, and you can defer updates as long as you want - nobody will force you to update, unless you check the "auto minor version update" box.

1 more reply

cookiecaper9y ago

Fortunately, AWS has moved on; I don't think that any new instances can be created with ephemeral storage anymore. Pure EBS now.

You need multi-AZ for true HA. Failover within the same AZ has a small delay, as you've noted.

>I still enjoy using RDS, the service is stable and quick to use, but just make sure you have the habit of backuping and take serious ownership and responsibility of the database.

2 more replies

AlisdairO9y ago

Not sure how it works on cloudformation, but in the console and API you have to explicitly skip the final snapshot.

nhumrich9y ago

I know about the daily snapshots, but didn't know about the archive logs. Is this something I have to enable? How do I get the logs and how do I restore using them?

illumin89y ago

1 more reply

robodale9y ago

Jesus christ, thanks for the infomercial.

bitshepherd9y ago

tedunangst9y ago

But none of the issues were azure/cost related except for the slow recovery? I mean, neither AWS nor GCE can make you notice youre not getting cron mail.

1 more reply

ronack9y ago

Yes, I recall seeing a ticket that referenced Gitlab using Azure because it was heavily subsidized. My company uses Azure for much the same reason, and my experience has been largely positive.

MichaelGG9y ago

Is Azure cutting deals beyond the usual free 60k over a year or two or whatever it is for cool startups? Azure seems significantly more expensive in general, problems and slowness aside.

1 more reply

NPegasus9y ago· 17 in thread

  > Root Cause Analysis
  > [...]
  > [List of technical problems]

ebiester9y ago

I am a senior engineer who has seen shit go down. I am quite literally the graybeard.

I am not a GitLab customer, I am not a startup junkie, and I'm usually considered one of the more conservative (in action, not politics) engineers in my peer group in technology adoption.

The cloud is just someone else's computer.

Or is a company only legitimate if it's in an open space in the Valley?

colechristensen9y ago

I couldn't agree with you more.

One of the very most important aspects to avoiding failure is being amiable when it happens. Fear of failure causes quite a bit of failure and stupid behavior to try to hide and avoid it.

I also simply don't understand the vitriol towards remote work.

cookiecaper9y ago

I'm not saying that this doesn't happen to senior engineers who are victims of bad management, but qualified leadership doesn't allow it.

gumby9y ago

> Mark my words, the board members from the VC firms will be removed by the VC partners

I have never, ever, seen this. Every firm has its own internal politics [1] but rarely if ever would they do this. They are more likely to just ignore it.

There is a belief that it takes a decade to tell if a VC is any good or not, and that includes "learning experiences" (all on the LP's dime of course).

> Then VC firms will put an experienced CEO and CTO in place

Now this I have seen. It even works sometimes (e.g. Eric Schmidt/Google).

pfarnsworth9y ago

pzh9y ago

1 more reply

pzh9y ago

grzm9y ago

You're aware he's in this thread, right? And the hearsay matches up with his comments, from what I've read.

1 more reply

elliotec9y ago

They also openly pay well below market rate, so they should expect to get what they pay for.

bogomipz9y ago

Can you elaborate on that?

https://about.gitlab.com/jobs/production-engineer/

5 more replies

gech9y ago

Didn't Samsung have some phones catch on fire? Didn't Delta go down recently?

Cyph0n9y ago

Nah, let's just all board the anti-GitLab train. I honestly don't understand why the HN crowd has their panties in a knot. HN wasn't this bad even during the VW fiasco.

developer29y ago

>> the root cause is you have no senior engineers who have been through this before

Huge red flag that your data cannot be trusted.

milesrout9y ago

I can't tell if this is some kind of joke or not.

There's absolutely nothing wrong, whatsoever, with having a public IP address on a production database server.

2 more replies

p4lindromica9y ago

Exactly. The technical problems are not the root cause. How did the existing processes (or lack thereof) fail? How did the organization fail?

AsyncAwait9y ago

Like we haven't seen similar screwups at other, more 'proffesional' companies. They may shrug it under the rug better, but that's all.

dorianm9y ago

Exactly, I feel like it was nobody's job to make sure everything was resilient to failures

meowface9y ago· 14 in thread

I could feel the sweat drops just from reading this.

I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.

lobster_johnson9y ago

Brings back memories, though not of anything I did. Quoting a comment I made on HN recently in a different thread:

---

Back in 2009 we were outsourcing our ops to a consulting company, who managed to delete our app database... more than once.

lojack9y ago

Steps I personally take to avoid this:

- Avoid prod boxes like the plague

- Set up a prompt (globally) to make it extremely obvious that you're in production. Something like a red background and black text saying "PRODUCTION"

- Never ever leave open sessions

Now, lets just cross my fingers I don't jinx myself :-)

StreamBright9y ago

DigitalJack9y ago

It may be unlikely to come from you, but it's for damn sure that it'll never happen to said engineer again.

Also, I would make sure to have a different prompt than default for non-prod systems too. That way you know to be suspicious if it hasn't been changed from default.

viraptor9y ago

Peer review though - yes. That could help. I wouldn't say "I'm unlikely to make that mistake" - it's likely to go on the famous last words list...

2 more replies

chrismorgan9y ago

Fortunately disks were slower back then, so it hadn’t deleted too many files when I interrupted it, and the computer was able to be recovered without too much inconvenience.

rectang9y ago

For me, it was when I meant to execute this...

    rm -rf ~/foo

... but executed this instead:

    rm -rf ~ /foo

cjbprime9y ago

My worst data loss was:

    $ tar cvfz mbox outbox mbox.tar.gz

The argument order is backwards -- the output file is supposed to be first, then the input files.

On my system, this overwrote my full mailbox with a gzipped copy of my outbox, and a complaint that the mbox.tar.gz input file didn't exist.

That's right, the worst data loss happened while I was trying to take a backup. :(

2 more replies

BinaryIdiot9y ago

1 more reply

dsavinkov9y ago

this is the reason why I always use quotes before specifying folders with rf -rf :)

2 more replies

meowface9y ago

For me it was a simple `rm -rf .`. Thought I was in ~/somefolder/, was actually in ~...

I stopped it after about 3 seconds, but that was enough to do critical damage.

a_t489y ago

Misread that as "The engineer was terminated" at first. Poor guy.

haldean9y ago

Potential outage prevention plan: put an alias on all production machines that emails HR to schedule a disciplinary meeting every time you run `rm -rf`.

INTPenis9y ago

At a small web host, early in my career, I once saw the boss blurr past my desk towards the server room. Throw open the big vault door and disappear inside.

Ended up restoring a few home dirs from tape.

KayEss9y ago· 8 in thread

The engineers still seem to have a physical server mindset rather than a cloud mindset. Deleting data is always extremely dangerous and there was no need for it in this situation.

They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.

Only when the replication is back in good order do you go through and kill the servers you no longer need.

matt_wulfeck9y ago

This type of thing always sounds good and all, but the reality is people get desperate and emotional when their website is down and everyone wants it up ASAP.

KayEss9y ago

I certainly don't disagree with that, but if you have this automated it is also the fastest way to get it back up and running. Besides, the site wouldn't have been down if they had this.

jjirsa9y ago

Desperate and emotional is no way to run a business

1 more reply

lomnakkus9y ago

> They should have spun up a new server to act as secondary the moment replication failed.

In a perfect world everything is cluster-ready &c at the outset. In this world it usually... isn't.

EDIT: ... and I'd posit that such cluster-readiness actually isn't worth it most of the time.

fredsir9y ago

You don't think it's worth it most the time because of the hassle of setting up and managing a cluster, or because clusters in and of it self is not necessary for most?

1 more reply

kuschku9y ago

That’s a nice idea, if you’re willing to pay the massive extra cost to actually rent all those overpriced systems.

For me, personally, going from cloud servers to rented dedicated servers cut my bill by 93% – more than an order of magnitude. At same performance.

In fact, it’d be cheaper to run 10x as many dedicated servers than to use cloud solutions for me.

theptip9y ago

It cut your cloud services bill by 93%, but how much did it increase your engineering bill by?

If your engineering time is free, then this calculation is complete. Otherwise it is not.

Does that 93% saving pay for a DB engineer, or enough of your developers' time to build the same quality of redundancy as you'd get with a DBaaS?

This calculus is going to be different for every DB and every company, but the OpEx impact of switching to dedicated servers is a bit more complex than you suggest above.

1 more reply

jjirsa9y ago

The fact that they're using a single master software is already an antiquated concept in 2017

yarper9y ago· 7 in thread

It's amazing how quickly it descends into "one of the engineers" did x or y. Who was steering this ship exactly?

It's really simple to point the finger and try to find a single cause of failure - but it's a fools errand - comparable to finding the single source behind a great success.

jdavis7039y ago

Do you expect management to be staring over your shoulder every time you do some kind of `rm` on a production server? With great power comes great responsibility.

yarper9y ago

People get tired, sick, frustrated, panic - part of being a responsible engineer is accepting you're as fallible as the next person and building in protection against your own errors.

1 more reply

damagednoob9y ago

_ph_9y ago

chadcmulligan9y ago

doesn't everyone alias rm to rm -i on prod?

likewise all tty's have red backgrounds on prod.

1 more reply

pfarnsworth9y ago

90% of all outages are caused by human error. That's why change management solutions were so big 10 years ago, trying to get rid of the human element of changes in an enterprise.

angry_octet9y ago

ky7389y ago· 6 in thread

RIP the engineer

YorickPeterse9y ago

Still here, and doing just fine.

sslalready9y ago

I believe both GitLab and the community in general will come out stronger from this incident. Thank you all for being so transparent about it.

1 more reply

tuyguntn9y ago

Probably this fail makes Gitlab more reliable in the future. What doesn't kill you makes you stronger.

boulos9y ago

I enjoy your updated profile.

AsyncAwait9y ago

Actually, I'll be surprised if he hasn't received any offers by now. I would perhaps specifically hire him to deal with databases as I am pretty sure he's never going to make this mistake again.

YorickPeterse9y ago

I haven't received any offers so far. I don't intend on leaving GitLab any time soon either.

nodesocket9y ago· 5 in thread

My main question is still:

>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.

Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?

cookiecaper9y ago

The cause is bad configuration.

scurvy9y ago

+1 for replication slots. Just remember to remove the slot if you decommission a server; otherwise storage on the master will grow forever.

nodesocket9y ago

Wow, thanks. This is like the best answer I've ever seen. You absolutely nailed it.

1 more reply

ploxiln9y ago

scaryclam9y ago

There was a nice response from the PostgreSQL guys here: http://blog.2ndquadrant.com/dataloss-at-gitlab/

There is a bug that might have been hit, but it appears as though there were other issues at play as well.

atmosx9y ago· 4 in thread

Great to have a full-featured, professional post-mortem. Incidentally I work at a company that suffered data loss because of this outage and we're looking for ways to move out of GL.

My 2 cents... I might be the only one, but I don't like the way GL handled this case. I understand transparency as a core value and all, but they've gotten a bit too far.

IMHO this level of exposure has far-reaching, privacy implications for the ppl who work there. Implications that cannot be assessed now.

sytse9y ago

JasonSage9y ago

This is a perfect response.

Every day I'm growing more to like GitLab. It took me way too long to realize that GitLab has a singular focus to change how people create and collaborate.

When you're guided on principle, it's much easier to accept losses here and there in the right way...

Sytse, I love what you're doing and I look forward to seeing your continued resilience and dedication to your goal. The world needs more businesses like this.

1 more reply

atmosx9y ago

Thanks for the reply.

> It is not our intent to have one of our team members implicated by the transparency. That is why we redacted their name to team-member-1 and in any future incidents we'll do the same.

Great, good to know. I wish all the success in the world to you and everyone involved with Gitlab.

1 more reply

AsyncAwait9y ago

> your values are defined by what you do when it is hard.

Precisely.

Most companies would stay as quiet about this as possible, you guys remained transparent and this is why I'll remain a customer.

1 more reply

ancarda9y ago· 4 in thread

>Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late.

My 2 cents.

gl-9y ago

ancarda9y ago

Absolutely, that's a great idea!

You could also check for evidence a run has been successful, although that does depend on what you're doing exactly.

mkopinsky9y ago

How do you set cron to post to Slack?

ancarda9y ago

We use Monolog (PHP library) that is able to post to Slack.

It's a very cool library. https://github.com/Seldaek/monolog

You can see the handlers here: https://github.com/Seldaek/monolog/tree/master/src/Monolog/H... which includes Slack, HipChat, IFTTT, Pushover, etc...

cookiecaper9y ago· 3 in thread

P.S. bonus point, just speculative:

mschuster919y ago

> Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds like a stretch.

That only depends on the # of files. If it's even a thousand files, any modern Linux rm -rf will remove them in less time than a blink.

YorickPeterse9y ago

    > 1. notifications go through regular email. Email should be only one
    > channel used to dispatch notifications of infrastructure events. Tools
    > like VictorOps or PagerDuty should be employed as notification
    > brokers/coordinators and notifications should go to email, team chat, and
    > phone/SMS if severity warrants, and have an attached escalation policy so
    > that it doesn't all hinge on one guy's phone not being dead.

This is discussed in https://gitlab.com/gitlab-com/infrastructure/issues/1095. We're basically looking into moving to Prometheus monitoring combined with Pagerduty notifications.

    > 2. there was a single database, whose performance problems had impacted
    > production multiple times before (the post lists 4 incidents). One such
    > performance problem was contributing to breakage at this very moment. I
    > understand that was the thing that was trying to be fixed here, but what
    > process allowed this to cause 4 outages over the preceding year without
    > moving to the top of the list of things to address?

High availability has been a thing we wanted to do for a while, but for whatever reason we just never got to it (until recently). Not sure exactly why.

    > Wouldn't it be wise to tweak the PgSQL configuration and/or upgrade the
    > server before trying to integrate the hot standby to serve some read-only
    > queries?

    > And since a hot standby can only service reads (and afaik this is not a
    > well-supported option in PgSQL), wouldn't most of the performance issues,
    > which appear write-related, remain? The process seriously needs to be
    > reviewed here.

    > And am I reading this right, the one and only production DB server was
    > restarted to change a configuration value in order to try to make
    > pg_basebackup work?

We suspect so. Chef handles restarting processes and we think it's currently still set to do a hard restart always, instead of doing a reload whenever possible.

    > What impact did that have on the people trying to use the site a) while the
    > database was restarting

A few minutes of downtime as the DB is unavailable.

    > and b) while the kernel settings were tweaked to accommodate the too-high
    > max_connections value?

No. We now reduced the max_connections to a lower value (1000) so we still have enough but don't to tweak any kernel settings.

    > Is it normal for GitLab to cause intermittent, few-minute downtimes like
    > that? Or did that occur while the site was already down?

We've had a few too many cases like this in the past. We're aiming to resolve those, but unfortunately this is rather tricky and time consuming.

    > 3. Spam reports can cause mass hard deletion of user data?

Yes.

    > Has this happened to other users?

Not that I know of.

    > What's the remedy for wrongly-targeted persons?

    > And is the GitLab employee's data gone now too?

No. The removal procedure was throwing errors, causing it to roll back its changes. This kept happening, which prevented the user from being removed. So ironically an error saved the day here.

    > How could something so insufficient have been released to the public

Code is written by developers, and developers are humans. Humans in turn make mistakes. Most project removal related code also existed before we started enforcing stricter performance guidelines.

    > and how can you disclose this apparently-unresolved vulnerability? By so
    > doing, you're challenging the public to come and try to empty your database

    > because LVM snapshots now occur hourly, and that it only takes 16 hours to
    > transfer LVM snapshots between environments :)

    > 4. the PgSQL master deleted its WALs within 4 hours of the replica
    > "beginning to lag" (<interrobang here>). That really needs to be fixed.

Yes, which is also something we're looking into.

    > Again, you probably need a serious upgrade to your PgSQL server because it
    > apparently doesn't have enough space to hold more than a couple of hours of
    > WALs (unless this was just a naive misconfiguration of the
    > [min|max]_wal_size parameter, like the max_connections parameter?)

Probably just a naive configuration value since we have plenty of storage available.

    > There were a few other things (including someone else downthread who pointed
    > out that your CEO re-revealed your DB's hostnames in this write-up, and that
    > they're resolvable via public DNS and have running sshds on port 22), but
    > these are the big standouts for me.

    > Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds
    > like a stretch.

Nope, after about 2 seconds the data was gone. Context: I ran said command.

    > Some data may've been recoverable with some disk forensics.

    > Especially if your Postgres server was running at the time of the deletion,
    > some data and file descriptors also likely could've been extracted from
    > system memory

That only works for files still held on to by PostgreSQL. PostgreSQL doesn't keep all files open at all times, so it wouldn't help.

cookiecaper9y ago

> Hot standby is also supported just fine out of the box, you just need something third party for the actual load balancing.

> We've had a few too many cases like this in the past. We're aiming to resolve those, but unfortunately this is rather tricky and time consuming.

> Code is written by developers, and developers are humans. Humans in turn make mistakes. Most project removal related code also existed before we started enforcing stricter performance guidelines.

> Probably just a naive configuration value since we have plenty of storage available.

> That only works for files still held on to by PostgreSQL. PostgreSQL doesn't keep all files open at all times, so it wouldn't help.

1 more reply

gr20209y ago· 2 in thread

I'm glad it all worked out in the end!

sytse9y ago

We too are glad we had those snapshots. And while it was the worst thing that ever happend at GitLab it is humbling to know that it could have been worse.

animex9y ago

What do you think would have happened if you had total data loss/failure?

1 more reply

Achshar9y ago· 2 in thread

Does anyone have a link to the YouTube stream they's talking about? Can't seem to find it on their channel. And the link in the doc is redirecting to the live link [1] which doesn't list the stream.

[1] - https://www.youtube.com/c/Gitlab/live

cmatija9y ago

Yup,

Thanks for taking the interest to check it out.

It's an unlisted YT video, so that's why it might be hard to find.

Here it is: https://www.youtube.com/watch?v=nc0hPGerSd4

Achshar9y ago

Thanks! I'll check it out.

tschellenbach9y ago· 2 in thread

carterehsmith9y ago

According to this article:

> https://www.theregister.co.uk/2016/11/14/gitlab_to_dump_clou...

...they were moving away from the cloud, to their own servers.

sytse9y ago

We changed our minds. We're working on a post in https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/... but we wanted to post this first.

1 more reply

dustinmoris9y ago· 2 in thread

I know they have many fans who just look past every mistake no matter how bad it was only because they are open about it, but common, this is now just taking the piss no?

egwor9y ago

dustinmoris9y ago

I don't actually think it has anything to do with education. I think it really comes down to common sense.

2 more replies

nowarninglabel9y ago· 1 in thread

sytse9y ago

Glad to hear it was of use to you.

_Marak_9y ago· 1 in thread

I've noticed a lot of other positive activity and press for Gitlab for in the past month.

It's unfortunate they had this technical issue, but it's good to see others ( besides Github ) operating in this space. I should give Gitlab a try sometime.

sperglord9y ago

isoos9y ago· 1 in thread

sytse and GitLab folks: thank you for the transparency.

sytse9y ago

You're welcome. Thanks for all the kind responses we received https://twitter.com/i/moments/826818668948549632

samat9y ago· 1 in thread

Am I missing something or didn't they mention 'test recovery, not backups'?

sytse9y ago

Two of the issues linked from the article deal with testing the backups:

- Automated testing of recovering PostgreSQL database backups https://gitlab.com/gitlab-com/infrastructure/issues/1102

- Build Streaming Database Restore https://gitlab.com/gitlab-com/infrastructure/issues/1152

grhmc9y ago· 1 in thread

Thank you for redacting who the engineer was. Great write-up. Thank you!

sytse9y ago

You're very welcome. By the way the engineer was cool with being named, but we think this is good practice for future postmortems. I do hope it doesn't come soon.

EnFinlay9y ago· 1 in thread

Most destructive troll ever.

jolopes9y ago

Glad you read it till the end.

greenrd9y ago

matt_wulfeck9y ago

I can only image this engineer's poor old heart after the realization of removing that directory on the master. A sinking, awful feeling of dread.

I've had a few close calls in my career. Each time it's made me pause and thank my luck it wasn't prod.

aabajian9y ago

This is an outstanding writeup, but I wonder if it glosses over the real problem:

>>The standby (secondary) is only used for failover purposes.

>>One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup.

voidlogic9y ago

Live and learn I guess.

pradeepchhetri9y ago

Just want to add here that using tools like safe-rm[1] across your infrastructure would help in preventing data losses by running rm on unintended directories.

[1]: https://launchpad.net/safe-rm

jsperson9y ago

>An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.

This is a great attitude. Too often opportunity cost isn't considered when making rules to protect folks from doing something stupid.

XorNot9y ago

dancryer9y ago

Can't help but notice that the new backup monitoring tool suggests that the latest PGSQL backup is almost six days old...

Is that correct? http://monitor.gitlab.net/dashboard/db/backups?from=14859419...

nierman9y ago

definitely monitor your replication lag--or at least disk usage on the master--with this approach (in case wal starts piling up there).

nstj9y ago

@sytse were you in contact with MS/Azure during the restore? If so did they offer any assistance, e.g in speeding up restoration disk speed etc

jsingleton9y ago

TIL GitLab runs on Azure. If your CI servers or deployment targets are also on Azure then the latency should be pretty low (assuming you get the correct region). Good to know.

I moved from AWS to Azure years ago. Mainly because I run mostly .NET workloads and the support is better. I've recently done some .NET stuff on AWS again and am remembering why I switched.

AlexCoventry9y ago

Thank you for this informative postmortem and mitigation outline.

oli56799y ago

I found this entertaining, even if they did later admit that it was a hoax:

http://serverfault.com/questions/587102/monday-morning-mista...

khazhou9y ago

Every internal Ops manual needs to begin with the simple phrase:

DON'T PANIC

encoderer9y ago

If you want to up your cron job monitoring game there's a link in my profile.

j / k navigate · click thread line to collapse