I've generally found AWS more reliable than GCP - even when GCP isn't having downtime, you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.
AWS has had multiple instances of cascading EBS backplane failures, but outside of that I've found their core services pretty reliable -- 400+ days of uptime on a lot of VMs in systems I've worked on -- I avoid EBS when I can.
My advice is to keep your stuff simple - PaaS might seem attractive, but you have so little control as you mention when something goes down. Embrace multi-cloud by using the lowest common denominator of tech available - virtual machines, dns, networking, and instance storage if that suits your needs. Treat vms as disposable - and make sure you have system, service, and data redundancy at that level to survive the failure of an entire availability zone across your application.
If your app is that big, someone is always carrying a pager for when there are problems. The difference is on PaaS, you can't do a damn thing about it if it's a problem with the platform.
I've helped multiple companies get off of app engine because even for companies losing money (startups), it's too unreliable -- and actually very slow (datastore) if your app is relational. Also, it's very very expensive if you hit the datastore hard.
No matter which cloud platform you're using you should do this[1]. I'm not familiar with the GCP SDK but I know the AWS SDK has it built in[2]. If you're not using the SDK then you have to build it yourself. There will always be a small percentage of transient errors due to the network, DNS, timeouts, hardware failure, etc.
[1] This is a blanket generalization, there are some situations where you shouldn't use the backoff/retry pattern even for retryable errors.
[2] https://docs.aws.amazon.com/general/latest/gr/api-retries.ht...
In terms of reliability, I think the first step is multi-region -- being able to failover to another region should your primary region have major failage. But assuming you can do that, doing multi-cloud for the same thing shouldn't be so hard provided you have some sort of common open source runtime to run on both platforms.
Doesn't mean that we don't need SSH ever, but 99% of the time it's something we use because we're too lazy to setup automation.
I reckon you're using open-source here to mean self-hosted, but that doesn't really change anything. For example, the reason every small company I've worked at didn't have a way to analyze their logs/stderr and coincide them with other events for debugging was because they didn't, not because they couldn't.
If having solar panels becomes consistently cheaper than buying electricity from the grid (per megawatt), then individuals and businesses will all switch to solar panels... Especially if the business uses a lot of electricity.
The main reason that PaaS solution are popular now is because of advertising and hype. It's a bubble.
The official status page breaks down availability by-service with descriptions of each outage and updates with timestamps.
I think this is both the issue with the article, and the issue with Firebase (ironically).
First of all, its an issue with Firebase. All software will break. This is inevitable. Its just a matter of time. Well engineered software/infrastructure gives you, the consumer, tools to mitigate this so your consumers never see it. If we look at amazon, they expose AZs and Regions; well architected applications use these failure domains to accept that an AZ, and possible even a region, will fail. So you can do fallover.
Firebase really doesn't expose these primitives, in an effort to be simple and easy to use. Maybe they're doing something in the backend to use them, but the proof is in the pudding; if their stability is bad, it means they're not doing a good enough job at abstracting away these unavoidable failure domain principals.
Which brings us to the second problem: Its Always Your Fault. Stop trying to pass blame to Firebase. Your customers, seriously, full stop, unequivocally, no exceptions, do not care that Firebase caused you to go down. They care that you went down. You don't get to say "its not our fault!"
Because Its Always Your Fault. Its your fault that you chose Firebase. Its your fault that you chose a service which doesn't expose core failure domain primitives that you can engineer to support. Its your fault for not getting off Firebase when you recognize these core issues with the platform.
Firebase's status page is for you, the engineer, to understand and diagnose issues. Its for you to interpret and surface on your own status page. Its not for you to link to your customers and say "see that red dot? that's why we went down."
And by the way, Yes: Even perfectly architected applications on AWS/Gcloud/Whatever, falling over AZs and Regions, can go down due to things outside of your control. AWS ain't perfect. Remember: All software breaks. But when you word that to your customer, You Always Take The Blame. Period. This is what "its always your fault" means; its not about saying that there are ways to write an application that never breaks. Its about accepting that when (not if) it does break, your customers will blame you, so you need to accept that blame wholly.
That’s part of the problem, actually. I’ve noticed for years that some Firebase service distruptions go unreported, and it was clear that reporting individual services was a way to avoid showing the end-to-end summary. It doesn’t matter that all of Firebase’s servers are up and running, if the end-to-end service they provide isn’t working.
Add to that the ability to resolve connections dropping out (common on mobile) and that their libraries have been ported all over the place, and Firebase is a defacto answer for mobile developers. It can be up and running from in less than 30 minutes for someone who has 0 experience in cloud development.
It is hard to replicate that.
I understand the skepticism, but I would highly suggest taking a look and playing around. It's really, really good plus you get to fully own everything you build ;)
Less than 99.95% but equal to or greater than 99.0%: 10% credit
Less than 99.0%: 30% credit
[1] https://firebase.google.com/terms/service-level-agreement/
https://firebase.google.com/docs/firestore/rtdb-vs-firestore
I see this is flagged, but FWIW, you might want to actually learn something about what they mean by “realtime database” because it’s incredibly useful, and people using Firebase aren’t the only people who think so.
https://en.m.wikipedia.org/wiki/Real-time_database
Firebase is also easy to use and scales to large sites and complex applications, despite the complaints here about reliability, reporting and control, or lack of. A flat file and simple web socket server crumbles under loads that Firebase handles easily.
Amazon on the other hand has a history of committing to clear product direction which makes committing to their platforms much easier. Amplify and AppSync for instance feel like safer choices.
I have always wondered what a reliable backup to the realtime db could be. Havent found much till date.
The unacceptable thing is : not only outages are fairly common, many smaller, briefer outages and disruptions are not even reported. For example the day after the 2 hour outage mentioned in the article, there was an issue where while writing to the database seemingly successful, but the clients listening to the changes would NOT receive the notification that the data their are observing was updated, for more than 30 minutes. It wasn't reported in Firebase's status dashboard.
Google bought Firebase back then, and to replace Firebase Realtime Database, Google developed Firebase Firestore (now in beta). I suspect that Firebase Realtime Database isn't receiving much attention these days and that the service will be closed after some time.
It really is possible to design a system around firebase with a much smaller team. You give up control but control is a myth anyway. And, Firestore is actually designed to support offline mode, so wonder if they neglected to design for that feature which might help here.
The unfortunate reality is that we are in a moment where Firestore is beta and Firebase Database is not supported as it should be. Google should do a better job of helping people to migrate and explaining the roadmao. I imagine the writer of this article just doesn't have as much company clout to get that level of involvement from Google. This was probably an attempt to get that attention that other higher paying clients can get.
The idea is that you update your documents (PDF, HTML, etc) into Polar, tag them, and then we sync them to the cloud. Then when you go to another machine like work or home your documents are always synchronized.
At first I fell in love with Firebase and was very very excited to start implementing it.
They've spent a ton of time working on the initial implementation experience.
Their Firebase Auth support was amazingly simple to setup. Same with Firebase hosting. It's top notch. You can be up and running with a CDN hosting with SSL in like 2 minutes and the firebase tools are exceptional.
Cloud Firestore seems really interesting and easy to setup. It's basically designed for 'apps'. IE user-facing apps and works pretty well if all the data is private to the user.
I do struggle with these issues of reliability though. At Datastreamer (http://www.datastreamer.io/) we use Hetzner and have about a half petabyte stored there.
It's a blog content search engine which we license to other startups so high availability is critical.
Their infra is amazingly reliable. Very very happy here.
The problem of course is that you then have to manage your own software stack which of course requires extra effort on your part.