The most upvoted comment forgot how to spell (or doesn't know) to petabyte.
I swear criticising someone's spelling is the last bastion in an argument/debate/discussion. When you haven't got anything else, attack their spelling.
I'm not saying that you're getting into an argument or debate but come on, you know what the guy meant.
Correct spelling and grammar is what makes your OS function, the doses of medicine prescribed correct and it can sometimes be the difference between life and death.
I view it as the foundation for any thing important that you want to communicate.
In fact, from the sentence structure, I know that he (definitely male, aged in mid thirties, right handed, wears a hand knitted cardigan, keeps a cucumber sandwich in his briefcase) was typing this on an iPhone 4S (still running iOS 8) on the 5:30pm train from Campbeltown and reached the middle of the word just as the train entered the Chuddingsworth tunnel. There is a slight offset in the tracks there that causes the train carriages to lurch to the left a little, thus causing his thumb to slide across just that little bit after he hit the 't' in 'petabyte'. If you need any more info, please do not hesitate to contact me: 222b@holm.es
The problem is that a 0-downtime system, at a certain scale, is very costly to create and maintain. You need redundancy everywhere, load balancing everywhere, data replication, synchronization. Those are hard problems.
Basically you need to arrive to the level of being able to release the Netflix Chaos Monkey in prod to be sure it works even if part of your system is busy with the update, or just out of sync. This is certainly doable. It's also very expensive, requires a lot of time and many experts to work on the problem.
Putting a site on maintenance mode can be a middle ground you choose, because you don't want to invest that much just to avoid taking down you site for a little time once in a while.
Economics.
Of course, if you do choose the road of 0down time, you site will gain more than just availability, it will gain reliability as well, since those best practices serve both purposes.
It's like changing the engine of a driving car.
It's not:
- because it's can't be done; - because we don't know the way to do it; - because it's better that way.
It's because it's very expensive. In your case, either because you don't create your whole system, or have a legacy system to use, doing it would be prohibitive. It's still a matter of cost.
1. Companies don't know how to do the engineering for maximum uptime, like you describe. It's way more complicated than the usual CRUD operations
2. Companies know how to but they decide not to invest this time (we often traded one hour of downtime against 2-3 man-days for preparing online schema changes with nasty and inconsistent backfilling in the early days).
And 3. Don't forget disaster recovery. I've seen some of the smartest companies go down for hours due to a DB misconfiguration, or a Rack PSU faulting with only one side of the servers connected, even with a reasonably highly available setup. Stuff like this happens - and then you better have a proper 503 Maintenance page up and running to prevent Google from delisting your site. In this case though, "maintenance" is rather an euphemism :)
Cost increases exponentially for diminishing returns once you get into serious availability. For most businesses, the investment in moving from 99.9% to 99.99% or 99.99% to 99.999% uptime just isn't worth it - most customers are quite willing to "try again later" in practice, especially if you give them advance notice or have a regular maintenance slot.
MySQL only added support for online schema migrations with 5.6, prior to that you had to use a tool like pt-online-schema-change. I've heard claims (which I haven't verified, so it's entirely possible they are incorrect) of performance issues when performing migrations, which effectively bring the database down. Doesn't RDS sometimes require downtime for maintenance and upgrades? Is there any safe way around that?
If that looks complicated, it is - and you better only start with these things if your site earns more money per minute than you need to pay engineers and project managers to pull that off.
- Moving from AWS to our own datacenter. - Payment processor issues. We weren't making money with the payment processor down... “down for maintenance” meant lower customer service costs. - Because the CEO told me to. I shit you not. Be wary of working for someone has a name that sounds like it belongs on a bond villan. - Because sometimes you NEED all the resources to get something done quickly - In the days before AWS and "cloud computing" you only had hardware on hand. It is hard to get your boss to budget for a traffic spike of one hour that is greater than the sum of the previous 6 months of traffic. - Because non technical people have access to technology: It was just some javascript -or- I didn't think I needed to tell you before I emailed 5 million people with an offer for free stuff -or- why is everything on sale for %25 off .... - Because load and time and complex systems sometimes do funny things together, "maintenance" means were getting enough data reproduce it finally. - The very beginning of a DDOS attack (only for some industries & sites)
You don't see 'Maintenance' on systems of companies which do this for a long time. You might see this at 'normal' companies. Smaller ones who used the 'wrong' database and had to migrate it.
If you start with one database and 'forget' or just don't think about it to have a master, slave, slave combination, you have to fix that once.
When you made a mistake, you have to fix it once.
Also today you are able to maintain quite a big page with a very small amount of people. The chances, that one of those didn't think about all necessary elements of an always online system is not far fetched.
For some types of websites, zero-downtime upgrades and maintenance are costly.
Online banking is a good example. I have accounts with several banks, and all of them periodically "go down for maintenance". I assume that's because the talent and infrastructure needed to do those tasks with zero downtime are more expensive than whatever customer service hit they take for planned outages.
For example, at Google "down for maintenance" is not on the table. That can in some cases lead to lots of extra work or time, e.g. dual writes for a period of time followed by mapreduces to fix the remaining part.
My internet bank is often down for maintenance on Sunday nights. I assume it is because they have a very old system.
- occasionally benefits from clean-up tasks which can be long running and would result in an irritating experience. While slow read operations in theory may be possible it is better to tell the users to come back later than to erode their confidence.
- sometimes the database of a board can corrupt. The repair operations (sort of a disk fsck for the board) require the database exclusively.
- software upgrades
If the system has not been designed from the ground up for that type of service, then the on-board expertise would also have to be gifted at developing workarounds on-the-spot that reliably work the first time.