I'm aware that hot standby is supported, though it's not the default configuration for the standby server (default and safest is a standby mode that you can't query at all; hot standby introduces possible conflicts between hot read queries and write transactions coming in from the WAL, so if failover is your primary intention, you should be cold standbying). I'm saying that mixing read queries in and dispersing them over hot standbys is not well-supported, which is why you need third-party tools to do it.
It can also be risky if your replication lag gets out of control, and you've indicated that it easily does. PgSQL replication is eventually consistent and you risk returning stale data on reads, which could cause all sorts of havoc if it's not accounted for by the application internally.
> We've had a few too many cases like this in the past. We're aiming to resolve those, but unfortunately this is rather tricky and time consuming.
This may take some upfront work, but it's pretty routine. A serious commercial-level offering should not need to take itself offline without announcement in order to restart the single database server and apply a configuration tweak.
> Code is written by developers, and developers are humans. Humans in turn make mistakes. Most project removal related code also existed before we started enforcing stricter performance guidelines.
The point is not that humans make mistakes, nor that bugs exist. The point is that such a feature was released without considering its easily-exploitable potential and the permanent consequences of its exploitation (permanent removal of data). That should trigger a process review.
> There's no point in hiding it. Spending a few minutes digging through the code and you'll find it, and probably plenty other similar problems. If somebody tries to abuse it we'll deal with it on a case by case basis.
There's a lot of risk in drawing attention to this type of vulnerability. I think GitLab should be taking this more seriously. All code has bugs, but this isn't a bug; it's an incomplete, dangerously-designed feature that can be easily used by a malicious actor to permanent destroy large quantities of user data. Your CEO has just highlighted it before the whole world while it's still active and exploitable on the public web site.
Reading the code isn't a dead giveaway because it takes a lot of effort to find the specific code in question and realize what it means, and because the general assumption would be that GitLab.com is running a souped-up or specialized flavor of the code and that such dangerous design flaws must have already been resolved on a presumably high-traffic site. However, this post highlights that it hasn't been, and that's bad. This is effectively irresponsible self-disclosure of a very high-grade DoS exploit.
> Probably just a naive configuration value since we have plenty of storage available.
Having the storage readily available means that the hard part is already done! Each WAL segment is 16MB. You have about 350 GB of unused disk. Set wal_keep_segments and min_wal_size to something reasonable and you won't need to do this obviously-risky resync operation every time you have a couple of hours of heavy DB load.
> Revealing hostnames isn't really a big deal, neither is SSH running on port 22. In the worst case some bots will try to log in using "admin" usernames and the likes, which won't work. All hosts use public key authentication, and password authentication is disabled.
See discussion at https://news.ycombinator.com/item?id=13621027. The worst case is not a bruteforced login, it's an exploited daemon that leads to an exploited box that leads to an exploited network that leads to an exploited company. The secondary concern would be a DoS attack; everyone now knows that you have only one functioning database server that everything depends on, and that that server's IP is x.x.y.y. That's enough to cause trouble even without exploits or zero days.
> When using psycial disks not used by anything else, maybe. However, we're talking about disks used in a cloud environment. Are they actually physical? Are they part of larger disks shared with other servers? Who knows. The chance of data recovery using special tools in a cloud environment is basically zero.
Yes, this complicates things significantly. Something like EBS may be able to be used pretty similarly to a dd image, though there is no way to "pull the plug" on an EC2 server afaik (maybe it's exposed through the API). I've never used Azure so I don't know if this would be practicable there.
> That only works for files still held on to by PostgreSQL. PostgreSQL doesn't keep all files open at all times, so it wouldn't help.
Indeed. While PgSQL doesn't keep all files open at all times, it does keep some files open, and they may or may not have contained useful data. I personally would've also been interested in trying to freeze the memory state (something you can do with a lot of raw VMs that you can't do with physical servers, but admittedly probably not something the cloud provider exposes).