Firebase outages and misleading status reporting (opens in new tab)

(medium.com)

143 pointssauldcosta7y ago49 comments

49 comments

43 comments · 16 top-level

AppEngine had the same problems - seemingly every week some component of the service would be down for some non-negligible amount of time (laughably it was often search -- we're talking about Google here).

I've generally found AWS more reliable than GCP - even when GCP isn't having downtime, you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

AWS has had multiple instances of cascading EBS backplane failures, but outside of that I've found their core services pretty reliable -- 400+ days of uptime on a lot of VMs in systems I've worked on -- I avoid EBS when I can.

My advice is to keep your stuff simple - PaaS might seem attractive, but you have so little control as you mention when something goes down. Embrace multi-cloud by using the lowest common denominator of tech available - virtual machines, dns, networking, and instance storage if that suits your needs. Treat vms as disposable - and make sure you have system, service, and data redundancy at that level to survive the failure of an entire availability zone across your application.

latchkey7y ago

AppEngine had some big failures early on, but I (and some friends) built a $$$$$$$$$$$$ company on AppEngine (and GCP) and couldn't have done it without it. The stability the last few years has been extremely good. Our base logic was that we trust Google to hire and train talented DevOps more than we can do it and it sure sucks carrying a pager.

mattbillenstein7y ago

Snapchat?

If your app is that big, someone is always carrying a pager for when there are problems. The difference is on PaaS, you can't do a damn thing about it if it's a problem with the platform.

I've helped multiple companies get off of app engine because even for companies losing money (startups), it's too unreliable -- and actually very slow (datastore) if your app is relational. Also, it's very very expensive if you hit the datastore hard.

1 more reply

kevan7y ago

>you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

No matter which cloud platform you're using you should do this[1]. I'm not familiar with the GCP SDK but I know the AWS SDK has it built in[2]. If you're not using the SDK then you have to build it yourself. There will always be a small percentage of transient errors due to the network, DNS, timeouts, hardware failure, etc.

[1] This is a blanket generalization, there are some situations where you shouldn't use the backoff/retry pattern even for retryable errors.

[2] https://docs.aws.amazon.com/general/latest/gr/api-retries.ht...

samstave7y ago

Pretty amazing how virtualization has come to the point that we even need to virtualize our reliance on cloud vendors across multiple vendors to ensure realiability

mattbillenstein7y ago

I'll say, I've only really done multi-cloud for cases where I liked different products on different clouds -- the main app and data on AWS, but using Google's data stack (BigQuery >> Redshift imho).

In terms of reliability, I think the first step is multi-region -- being able to failover to another region should your primary region have major failage. But assuming you can do that, doing multi-cloud for the same thing shouldn't be so hard provided you have some sort of common open source runtime to run on both platforms.

sauldcostaOP7y ago

Appreciate the thoughtful reply and summary of experiences with GCP vs AWS.

romed7y ago

Retrying localizable errors (such as 503 Service Unavailable) should be universally practiced in any RPC scheme. Nobody can make a backend that's 100% reliable.

pbarnes_17y ago

AWS suffers from the same status page gaslighting though.

jondubois7y ago· 4 in thread

I think that PaaS and BaaS where you don't have access to the back end is a dead end. It's going to go the way of Windows Server. Open source solutions will always win in the end when it comes to developers.

wild_preference7y ago

People also balked during the transition away from FTP. SSHing into servers is precisely the thing you want to get away from whether it's to change code or to hotfix your nginx.conf or to do a quick apt-get install.

Doesn't mean that we don't need SSH ever, but 99% of the time it's something we use because we're too lazy to setup automation.

I reckon you're using open-source here to mean self-hosted, but that doesn't really change anything. For example, the reason every small company I've worked at didn't have a way to analyze their logs/stderr and coincide them with other events for debugging was because they didn't, not because they couldn't.

anothergoogler7y ago

"FTP -> SSH -> proprietary console" does not look like evolution over a gradient of control to me. I don't understand why you're comparing FTP to SSH when SSH is lower level than FTP. FTP "throw it on the server and let mod_php deal with it" deployments were decidedly higher level than SSH-based ones. FTP deployments were often coupled with GUI-based steps, for instance database migrations run from Drupal web app.

cortesoft7y ago

You are always going to have to rely on other service providers for critical things - networking, power, etc. I don’t think there is going to be some massive move for every business to be in control of every aspect of their supply chain. It simply isn’t feasible.

jondubois7y ago

It's different. If a business has the option to do something themselves and doing so would cost them less in the long run and give them more flexibility, then doing it themselves is a competitive advantage.

If having solar panels becomes consistently cheaper than buying electricity from the grid (per megawatt), then individuals and businesses will all switch to solar panels... Especially if the business uses a lot of electricity.

The main reason that PaaS solution are popular now is because of advertising and hype. It's a bubble.

1 more reply

sampl7y ago· 3 in thread

Not really convinced firebase is “covering it up”.

The official status page breaks down availability by-service with descriptions of each outage and updates with timestamps.

https://status.firebase.google.com

013a7y ago

From the article: "When I tell our customers something is wrong outside of our control"

I think this is both the issue with the article, and the issue with Firebase (ironically).

First of all, its an issue with Firebase. All software will break. This is inevitable. Its just a matter of time. Well engineered software/infrastructure gives you, the consumer, tools to mitigate this so your consumers never see it. If we look at amazon, they expose AZs and Regions; well architected applications use these failure domains to accept that an AZ, and possible even a region, will fail. So you can do fallover.

Firebase really doesn't expose these primitives, in an effort to be simple and easy to use. Maybe they're doing something in the backend to use them, but the proof is in the pudding; if their stability is bad, it means they're not doing a good enough job at abstracting away these unavoidable failure domain principals.

Which brings us to the second problem: Its Always Your Fault. Stop trying to pass blame to Firebase. Your customers, seriously, full stop, unequivocally, no exceptions, do not care that Firebase caused you to go down. They care that you went down. You don't get to say "its not our fault!"

Because Its Always Your Fault. Its your fault that you chose Firebase. Its your fault that you chose a service which doesn't expose core failure domain primitives that you can engineer to support. Its your fault for not getting off Firebase when you recognize these core issues with the platform.

Firebase's status page is for you, the engineer, to understand and diagnose issues. Its for you to interpret and surface on your own status page. Its not for you to link to your customers and say "see that red dot? that's why we went down."

And by the way, Yes: Even perfectly architected applications on AWS/Gcloud/Whatever, falling over AZs and Regions, can go down due to things outside of your control. AWS ain't perfect. Remember: All software breaks. But when you word that to your customer, You Always Take The Blame. Period. This is what "its always your fault" means; its not about saying that there are ways to write an application that never breaks. Its about accepting that when (not if) it does break, your customers will blame you, so you need to accept that blame wholly.

dahart7y ago

> The official status page breaks down availability by-service

That’s part of the problem, actually. I’ve noticed for years that some Firebase service distruptions go unreported, and it was clear that reporting individual services was a way to avoid showing the end-to-end summary. It doesn’t matter that all of Firebase’s servers are up and running, if the end-to-end service they provide isn’t working.

Guidii7y ago

Firebase offers a variety of individual services, and most apps pick up only the services they need. So reporting service-by-service makes more sense.

1 more reply

pg_bot7y ago· 2 in thread

If you need to build a product that relies heavily on real time updates, I would look into using Elixir and Phoenix.[0] They nailed the channel abstraction which is the main entry point for realtime communication over websockets. It takes me hours to make scalable realtime applications in what would normally take me days using other systems. The language may take some time to get used to, and the ecosystem isn't as mature as other languages, but what is there is incredibly impressive.

[0]: https://phoenixframework.org/

com2kid7y ago

Firebase does a lot more, including a slew of Auth options that make life much easier.

Add to that the ability to resolve connections dropping out (common on mobile) and that their libraries have been ported all over the place, and Firebase is a defacto answer for mobile developers. It can be up and running from in less than 30 minutes for someone who has 0 experience in cloud development.

It is hard to replicate that.

pg_bot7y ago

The common use cases for firebase can be easily reproduced with Phoenix. Phoenix also comes with a handy presence feature that allows you to track whether someone is currently using the product. (Think which present users in a chat room)

I understand the skepticism, but I would highly suggest taking a look and playing around. It's really, really good plus you get to fully own everything you build ;)

1 more reply

dotmanish7y ago· 2 in thread

I stopped using the realtime database once firestore was released in beta. So haven't experienced the downtime you have demonstrated in the status graphs, but Firebase's SLA [1] for realtime database apparently guarantees service credit for monthly uptime less than 99.95%. To corroborate your observations, check if you received this credit:

Less than 99.95% but equal to or greater than 99.0%: 10% credit

Less than 99.0%: 30% credit

[1] https://firebase.google.com/terms/service-level-agreement/

joeblau7y ago

Is Firestore a more reliable version of Firebase's real-time database?

dotmanish7y ago

It's a different database altogether - document-oriented at that.

https://firebase.google.com/docs/firestore/rtdb-vs-firestore

nslog7y ago· 2 in thread

Check out the AWS offerings (Amplify + AppSync) if you're rolling off Firebase: https://aws-amplify.github.io https://docs.aws.amazon.com/appsync/latest/devguide/welcome....

jaxondu7y ago

Amplify+AppSync client SDK support is pathetic compared with Firebase. No official support for Flutter, Xamarin and Unity apps.

mxuribe7y ago

I never heard of these offerings. Thanks for mentioning them!

novaleaf7y ago· 2 in thread

I think that now, Firebase is build on Google Cloud Datastore. I have used Datastore in production since 2015, and have had no outages, but if I had to do it again I think I'd go normal RDB, just because query support is extremely limited (no full text search) and "schema change == data rebuild" issues.

dfee7y ago

Do you mean that you’d use something like a managed Postgres AND build and run a backend service that interfaces a web client to that database?

novaleaf7y ago

yeah exactly.

iamleppert7y ago· 2 in thread

Serves you right for using a “real-time database” (whatever that is). I’m sure your chat product feature could have been designed using a flat file as a datastore and a simple web socket server.

dang7y ago

Please don't be a jerk on Hacker News. The idea here is: if you have a substantive point to make, make it thoughtfully; if you don't, please don't comment until you do.

https://news.ycombinator.com/newsguidelines.html

dahart7y ago

> Serves you right for using a “real-time database” (whatever that is).

I see this is flagged, but FWIW, you might want to actually learn something about what they mean by “realtime database” because it’s incredibly useful, and people using Firebase aren’t the only people who think so.

https://en.m.wikipedia.org/wiki/Real-time_database

Firebase is also easy to use and scales to large sites and complex applications, despite the complaints here about reliability, reporting and control, or lack of. A flat file and simple web socket server crumbles under loads that Firebase handles easily.

crystaln7y ago· 1 in thread

Firebase is really awesome. However there kinds of reliability issues and the lack of integrity and communication with which Google handles such things are major reasons I would avoid committing to it. On top of that, Google's history of overlapping products (Firebase or Firestore?) and discontinuing or foot dragging support make decisions confusing and commitment harrowing.

Amazon on the other hand has a history of committing to clear product direction which makes committing to their platforms much easier. Amplify and AppSync for instance feel like safer choices.

nslog7y ago

The Amplify and AppSync models are also architecturally more scalable as you don't have one big opaque DB and endpoint in a single region.

ankit2197y ago· 1 in thread

yeah, we have suffered too. Initially we were using firebase Real Time DB for authentication as well as delivering messages. Messages suffered outages every now and then (and we suffered more cos our backend is in Python Django and Pyrebase comes with its own set of issues on top of Firebase). When we found out messages arent being delivered, we switched to pusher as a backup first and then to websocket. Now we use Firebase only for authentication (via real time database) and Notification sending, and still have a backend/app trigger every time there is an error on firebase.

I have always wondered what a reliable backup to the realtime db could be. Havent found much till date.

nslog7y ago

AWS AppSync and Amplify

EZ-E7y ago

We had the same exact problem with Firebase Realtime Database. Our product uses it heavily and is dependent on its latency so we notice anytime an issue appear.

The unacceptable thing is : not only outages are fairly common, many smaller, briefer outages and disruptions are not even reported. For example the day after the 2 hour outage mentioned in the article, there was an issue where while writing to the database seemingly successful, but the clients listening to the changes would NOT receive the notification that the data their are observing was updated, for more than 30 minutes. It wasn't reported in Firebase's status dashboard.

Google bought Firebase back then, and to replace Firebase Realtime Database, Google developed Firebase Firestore (now in beta). I suspect that Firebase Realtime Database isn't receiving much attention these days and that the service will be closed after some time.

xrd7y ago

Have to say, having worked in a huge organization with multiple clients accessing services, I much prefer the firebase solution. You still have downtime in any polyglot solution and the problem is pretty clear here (it's firebase database, not one of dozens of legacy layers...). When you own the entire stack it is amazing how much of the organizational effort goes into obscuring who is responsible. And the stack is much more opaque.

It really is possible to design a system around firebase with a much smaller team. You give up control but control is a myth anyway. And, Firestore is actually designed to support offline mode, so wonder if they neglected to design for that feature which might help here.

The unfortunate reality is that we are in a moment where Firestore is beta and Firebase Database is not supported as it should be. Google should do a better job of helping people to migrate and explaining the roadmao. I imagine the writer of this article just doesn't have as much company clout to get that level of involvement from Google. This was probably an attempt to get that attention that other higher paying clients can get.

joeblau7y ago

I just build a service/website last weekend using App Engine and Firebase. After reading these comments and this blog post, I think I might migrate it over to AWS. I didn't realize that Firebase was so unreliable.

romed7y ago

Firebase RTDB is basically the legacy product that barely works. Firestore is the post-acquisition product built on Google tech. It’s a rotten situation. I noticed the outage mentioned in this post because it took down Ford GoBike (and Citibike and all the other Motivate/Lyft bike share systems).

burtonator7y ago

I've been thinking about implementing Firebase as part of Polar: https://getpolarized.io/

The idea is that you update your documents (PDF, HTML, etc) into Polar, tag them, and then we sync them to the cloud. Then when you go to another machine like work or home your documents are always synchronized.

At first I fell in love with Firebase and was very very excited to start implementing it.

They've spent a ton of time working on the initial implementation experience.

Their Firebase Auth support was amazingly simple to setup. Same with Firebase hosting. It's top notch. You can be up and running with a CDN hosting with SSL in like 2 minutes and the firebase tools are exceptional.

Cloud Firestore seems really interesting and easy to setup. It's basically designed for 'apps'. IE user-facing apps and works pretty well if all the data is private to the user.

I do struggle with these issues of reliability though. At Datastreamer (http://www.datastreamer.io/) we use Hetzner and have about a half petabyte stored there.

It's a blog content search engine which we license to other startups so high availability is critical.

Their infra is amazingly reliable. Very very happy here.

The problem of course is that you then have to manage your own software stack which of course requires extra effort on your part.

ramkalari7y ago

How does firestore fare in terms of reliability? I heard it is a cleaner and more scalable version of Firebase.

j / k navigate · click thread line to collapse

49 comments

43 comments · 16 top-level

mattbillenstein7y ago· 8 in thread

I've generally found AWS more reliable than GCP - even when GCP isn't having downtime, you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

latchkey7y ago

mattbillenstein7y ago

Snapchat?

If your app is that big, someone is always carrying a pager for when there are problems. The difference is on PaaS, you can't do a damn thing about it if it's a problem with the platform.

1 more reply

kevan7y ago

>you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

[1] This is a blanket generalization, there are some situations where you shouldn't use the backoff/retry pattern even for retryable errors.

[2] https://docs.aws.amazon.com/general/latest/gr/api-retries.ht...

samstave7y ago

Pretty amazing how virtualization has come to the point that we even need to virtualize our reliance on cloud vendors across multiple vendors to ensure realiability

mattbillenstein7y ago

I'll say, I've only really done multi-cloud for cases where I liked different products on different clouds -- the main app and data on AWS, but using Google's data stack (BigQuery >> Redshift imho).

sauldcostaOP7y ago

Appreciate the thoughtful reply and summary of experiences with GCP vs AWS.

romed7y ago

Retrying localizable errors (such as 503 Service Unavailable) should be universally practiced in any RPC scheme. Nobody can make a backend that's 100% reliable.

pbarnes_17y ago

AWS suffers from the same status page gaslighting though.

jondubois7y ago· 4 in thread

wild_preference7y ago

Doesn't mean that we don't need SSH ever, but 99% of the time it's something we use because we're too lazy to setup automation.

anothergoogler7y ago

cortesoft7y ago

jondubois7y ago

The main reason that PaaS solution are popular now is because of advertising and hype. It's a bubble.

1 more reply

sampl7y ago· 3 in thread

Not really convinced firebase is “covering it up”.

The official status page breaks down availability by-service with descriptions of each outage and updates with timestamps.

https://status.firebase.google.com

013a7y ago

From the article: "When I tell our customers something is wrong outside of our control"

I think this is both the issue with the article, and the issue with Firebase (ironically).

dahart7y ago

> The official status page breaks down availability by-service

Guidii7y ago

Firebase offers a variety of individual services, and most apps pick up only the services they need. So reporting service-by-service makes more sense.

1 more reply

pg_bot7y ago· 2 in thread

[0]: https://phoenixframework.org/

com2kid7y ago

Firebase does a lot more, including a slew of Auth options that make life much easier.

It is hard to replicate that.

pg_bot7y ago

I understand the skepticism, but I would highly suggest taking a look and playing around. It's really, really good plus you get to fully own everything you build ;)

1 more reply

dotmanish7y ago· 2 in thread

Less than 99.95% but equal to or greater than 99.0%: 10% credit

Less than 99.0%: 30% credit

[1] https://firebase.google.com/terms/service-level-agreement/

joeblau7y ago

Is Firestore a more reliable version of Firebase's real-time database?

dotmanish7y ago

It's a different database altogether - document-oriented at that.

https://firebase.google.com/docs/firestore/rtdb-vs-firestore

nslog7y ago· 2 in thread

Check out the AWS offerings (Amplify + AppSync) if you're rolling off Firebase: https://aws-amplify.github.io https://docs.aws.amazon.com/appsync/latest/devguide/welcome....

jaxondu7y ago

Amplify+AppSync client SDK support is pathetic compared with Firebase. No official support for Flutter, Xamarin and Unity apps.

mxuribe7y ago

I never heard of these offerings. Thanks for mentioning them!

novaleaf7y ago· 2 in thread

dfee7y ago

Do you mean that you’d use something like a managed Postgres AND build and run a backend service that interfaces a web client to that database?

novaleaf7y ago

yeah exactly.

iamleppert7y ago· 2 in thread

dang7y ago

Please don't be a jerk on Hacker News. The idea here is: if you have a substantive point to make, make it thoughtfully; if you don't, please don't comment until you do.

https://news.ycombinator.com/newsguidelines.html

dahart7y ago

> Serves you right for using a “real-time database” (whatever that is).

https://en.m.wikipedia.org/wiki/Real-time_database

crystaln7y ago· 1 in thread

Amazon on the other hand has a history of committing to clear product direction which makes committing to their platforms much easier. Amplify and AppSync for instance feel like safer choices.

nslog7y ago

The Amplify and AppSync models are also architecturally more scalable as you don't have one big opaque DB and endpoint in a single region.

ankit2197y ago· 1 in thread

I have always wondered what a reliable backup to the realtime db could be. Havent found much till date.

nslog7y ago

AWS AppSync and Amplify

EZ-E7y ago

We had the same exact problem with Firebase Realtime Database. Our product uses it heavily and is dependent on its latency so we notice anytime an issue appear.

xrd7y ago

joeblau7y ago

romed7y ago

burtonator7y ago

I've been thinking about implementing Firebase as part of Polar: https://getpolarized.io/

At first I fell in love with Firebase and was very very excited to start implementing it.

They've spent a ton of time working on the initial implementation experience.

Cloud Firestore seems really interesting and easy to setup. It's basically designed for 'apps'. IE user-facing apps and works pretty well if all the data is private to the user.

I do struggle with these issues of reliability though. At Datastreamer (http://www.datastreamer.io/) we use Hetzner and have about a half petabyte stored there.

It's a blog content search engine which we license to other startups so high availability is critical.

Their infra is amazingly reliable. Very very happy here.

The problem of course is that you then have to manage your own software stack which of course requires extra effort on your part.

ramkalari7y ago

How does firestore fare in terms of reliability? I heard it is a cleaner and more scalable version of Firebase.

j / k navigate · click thread line to collapse