Why are websites sometimes “down for maintenance”? (opens in new tab)

(softwareengineering.stackexchange.com)

57 pointsarachnids9y ago47 comments

47 comments

43 comments · 17 top-level

Demcox9y ago· 7 in thread

"petrabytes"...really?

The most upvoted comment forgot how to spell (or doesn't know) to petabyte.

Or it was a typo or they were thinking of something else at the time or they were typing their response on their mobile or any other number of reasons other than "forgot how to spell" or "doesn't know how to spell".

I swear criticising someone's spelling is the last bastion in an argument/debate/discussion. When you haven't got anything else, attack their spelling.

I'm not saying that you're getting into an argument or debate but come on, you know what the guy meant.

Demcox9y ago

I'm sorry that you feel this way, but being a CS student, ones know the horrors such simple spelling errors can unravel in codes, machine architecture and programming to name a few.

Correct spelling and grammar is what makes your OS function, the doses of medicine prescribed correct and it can sometimes be the difference between life and death.

I view it as the foundation for any thing important that you want to communicate.

nicky09y ago

I feel it necessary to point out the erroneous extra "to" in your sentence.

DanBC9y ago

Skitt's Law!

http://knowyourmeme.com/memes/skitt-s-law

1 more reply

dkersten9y ago

Or they were typing on a phone which "autocorrected" it. Happens to me all the time.

cyberferret9y ago

I posit that the offender may have rather large thumbs. Because if you look very very carefully at the standard QWERTY keyboard, you will notice that the letter 't' and the letter 'r' are in fact located precisely adjacent to each other.

In fact, from the sentence structure, I know that he (definitely male, aged in mid thirties, right handed, wears a hand knitted cardigan, keeps a cucumber sandwich in his briefcase) was typing this on an iPhone 4S (still running iOS 8) on the 5:30pm train from Campbeltown and reached the middle of the word just as the train entered the Chuddingsworth tunnel. There is a slight offset in the tracks there that causes the train carriages to lurch to the left a little, thus causing his thumb to slide across just that little bit after he hit the 't' in 'petabyte'. If you need any more info, please do not hesitate to contact me: 222b@holm.es

grkvlt9y ago

Would the train not need to lurch to the right for the thumb to slide to the left? Since, based on the sentence structure, it is obvious that the writer always sits facing the direction of travel in a train... ;)

sametmax9y ago· 6 in thread

No server NEEDS to go down for maintenance. You can avoid doing so for anything, at any scale, DB change, server updates, etc.

The problem is that a 0-downtime system, at a certain scale, is very costly to create and maintain. You need redundancy everywhere, load balancing everywhere, data replication, synchronization. Those are hard problems.

Basically you need to arrive to the level of being able to release the Netflix Chaos Monkey in prod to be sure it works even if part of your system is busy with the update, or just out of sync. This is certainly doable. It's also very expensive, requires a lot of time and many experts to work on the problem.

Putting a site on maintenance mode can be a middle ground you choose, because you don't want to invest that much just to avoid taking down you site for a little time once in a while.

Economics.

Of course, if you do choose the road of 0down time, you site will gain more than just availability, it will gain reliability as well, since those best practices serve both purposes.

amelius9y ago

One of the biggest problems is migrating your data to match updated code, while the system keeps running.

It's like changing the engine of a driving car.

sametmax9y ago

Yes but again you can create a system that allow that from the begining. It's just very, very costly to dev. And if you have legacy code, then a rewrite is even more prohibitive.

jholman9y ago

It's like changing the beating heart of a living person. That's probably impossible!

ebbv9y ago

This comment makes a lot of assumptions; the primary one being you are able to completely build your infrastructure from scratch. Which just isn't the case for the majority of companies or web sites out there.

mseebach9y ago

Of course they're able to. They just (rationally, typically) don't want to spend the money and effort getting there, and so they don't.

1 more reply

sametmax9y ago

The question is "why do they do it". And the answer is "cost".

It's not:

- because it's can't be done; - because we don't know the way to do it; - because it's better that way.

It's because it's very expensive. In your case, either because you don't create your whole system, or have a legacy system to use, doing it would be prohibitive. It's still a matter of cost.

greenleafjacob9y ago· 5 in thread

Always avoidable if that's a priority - schema changes can be done online in MySQL. Patches can be done on subsets of servers. Erlang even supports hot code reloading so that even if you had a single point of failure you can upgrade without losing file descriptors or in memory state. It is a lot simpler if you have the choice though, since you don't have to have multiple versions online at the same time. "Divisions of Ericsson that do [hot code reloading] spend as much time testing them as they do testing their applications themselves." [1]

[1]: http://learnyousomeerlang.com/relups

endymi0n9y ago

There's more and more nuanced reasons actually:

1. Companies don't know how to do the engineering for maximum uptime, like you describe. It's way more complicated than the usual CRUD operations

2. Companies know how to but they decide not to invest this time (we often traded one hour of downtime against 2-3 man-days for preparing online schema changes with nasty and inconsistent backfilling in the early days).

And 3. Don't forget disaster recovery. I've seen some of the smartest companies go down for hours due to a DB misconfiguration, or a Rack PSU faulting with only one side of the servers connected, even with a reasonably highly available setup. Stuff like this happens - and then you better have a proper 503 Maintenance page up and running to prevent Google from delisting your site. In this case though, "maintenance" is rather an euphemism :)

gaius9y ago

Companies know how to but they decide not to invest this time

Cost increases exponentially for diminishing returns once you get into serious availability. For most businesses, the investment in moving from 99.9% to 99.99% or 99.99% to 99.999% uptime just isn't worth it - most customers are quite willing to "try again later" in practice, especially if you give them advance notice or have a regular maintenance slot.

TheAceOfHearts9y ago

From a purely abstract point of view it's probably avoidable, but I'd conjecture many teams don't have the collective knowledge to effectively pull it off. Even if you plan things out carefully, something usually goes wrong :(. It only takes a small oversight to have it come crashing down. I think it's better to let your customers know ahead of time that you'll be performing maintenance, assume that something will go down, but still try to avoid it anyway.

MySQL only added support for online schema migrations with 5.6, prior to that you had to use a tool like pt-online-schema-change. I've heard claims (which I haven't verified, so it's entirely possible they are incorrect) of performance issues when performing migrations, which effectively bring the database down. Doesn't RDS sometimes require downtime for maintenance and upgrades? Is there any safe way around that?

abritinthebay9y ago

Also immutable deploys.

icebraining9y ago

There's no such thing if you have a database.

1 more reply

ploggingdev9y ago· 5 in thread

I don't recall sites like Google or Facebook ever being down for maintenance. Are there any articles that discuss how they manage application layer and database layer migrations?

endymi0n9y ago

A good start would be all of http://highscalability.com - but it more or less boils down to being able to roll back: And that rules out hard schema changes. So the proper and hard way is always a variant of: 1) Create another column, 2) Write to both columns at the same time from the database, 3) Create code to run on the new column, 4) Enable feature switch to run everything on the new column, 5) Build back code dealing with old column, 6) Remove old column.

If that looks complicated, it is - and you better only start with these things if your site earns more money per minute than you need to pay engineers and project managers to pull that off.

jholman9y ago

This is correct, except your step 4. It should say something like: 4a) Enable feature switch on 1% of requests, ensure that they're working correctly. 4b) go to 10%. 4c) start rolling it out across all requests.

tyingq9y ago

Both have the advantage of not having to present consistent data to end users.

dotancohen9y ago

I actually have seen Google down once, I might even have a screenshot. An immediate F5 (possibly after the screenshot) showed them back up. I'm not sure if it was my local Google office down (Israel), but the message was in English.

brianwawok9y ago

Google is so big some % of it is always down. Just low chances to hit it, and if you do a reload and you will hit another server.

1 more reply

visarga9y ago· 2 in thread

They need to change the oil.

dingaling9y ago

Which tangentially is why the USAF's E-4B airborne command posts have to land after about three days. Fuel isn't a problem but they don't have a way to replenish engine-oil in-flight.

arachnidsOP9y ago

Haha. I liked one of the answers further down in the thread - it points out that there is a cost associated with making your service smoothly upgradable, probably in both engineering time and in hardware. It's possible a lot of these companies are balancing this against the cost of just being offline for an hour or so, and making the rational choice.

nickjackson9y ago· 1 in thread

I really don't think there is any excuse for it this day and age especially when building sites from scratch. There are so many different techniques and technologies for doing zero downtime deploys, not to mention the numerous PaaS that will do it out of the box if you dont know how.

viraptor9y ago

There's still cost to it. It basically boils down to: do you lose more money during a manual maintenance period, or by hiring extra people to do all changes in zero-downtime style. (Or doing slower development with the existing team) The technology for transparent changes has been available for decades, although it's true - it's much easier to use today. But it still needs extra work. And someone has to pay for that work in the end.

zer00eyz9y ago

Reasons I have been "down for maintenance" in the past.

- Moving from AWS to our own datacenter. - Payment processor issues. We weren't making money with the payment processor down... “down for maintenance” meant lower customer service costs. - Because the CEO told me to. I shit you not. Be wary of working for someone has a name that sounds like it belongs on a bond villan. - Because sometimes you NEED all the resources to get something done quickly - In the days before AWS and "cloud computing" you only had hardware on hand. It is hard to get your boss to budget for a traffic spike of one hour that is greater than the sum of the previous 6 months of traffic. - Because non technical people have access to technology: It was just some javascript -or- I didn't think I needed to tell you before I emailed 5 million people with an offer for free stuff -or- why is everything on sale for %25 off .... - Because load and time and complex systems sometimes do funny things together, "maintenance" means were getting enough data reproduce it finally. - The very beginning of a DDOS attack (only for some industries & sites)

sigi459y ago

Because of thing they were not thought of.

You don't see 'Maintenance' on systems of companies which do this for a long time. You might see this at 'normal' companies. Smaller ones who used the 'wrong' database and had to migrate it.

If you start with one database and 'forget' or just don't think about it to have a master, slave, slave combination, you have to fix that once.

When you made a mistake, you have to fix it once.

Also today you are able to maintain quite a big page with a very small amount of people. The chances, that one of those didn't think about all necessary elements of an always online system is not far fetched.

e0m9y ago

"They're replacing the vacuum tubes in the servers"

curt159y ago

I've always wondered whether Apple takes its website "down for maintenance" before a product launch out of necessity or simply to build excitement.

seanwilson9y ago

Common causes are things like software upgrades and database changes. There's probably always a way to avoid it but going down for maintenance might be less effort and cheaper overall depending on the site. For example, if you can do it during a known time of low traffic or when you know users will just come back later. I've noticed several UK bank websites go down for maintenance during the night.

tyingq9y ago

The short answer is cost versus benefit.

For some types of websites, zero-downtime upgrades and maintenance are costly.

Online banking is a good example. I have accounts with several banks, and all of them periodically "go down for maintenance". I assume that's because the talent and infrastructure needed to do those tasks with zero downtime are more expensive than whatever customer service hit they take for planned outages.

petters9y ago

Because it is much easier than performing complicated modifications while the site is running.

For example, at Google "down for maintenance" is not on the table. That can in some cases lead to lots of extra work or time, e.g. dual writes for a period of time followed by mapreduces to fix the remaining part.

My internet bank is often down for maintenance on Sunday nights. I assume it is because they have a very old system.

heisenbit9y ago

PHP board software:

- occasionally benefits from clean-up tasks which can be long running and would result in an irritating experience. While slow read operations in theory may be possible it is better to tell the users to come back later than to erode their confidence.

- sometimes the database of a board can corrupt. The repair operations (sort of a disk fsck for the board) require the database exclusively.

- software upgrades

fuzzfactor9y ago

Not every aircraft has all the expertise, tools, and spares on board at all times to be able to service or replace their engines in flight.

If the system has not been designed from the ground up for that type of service, then the on-board expertise would also have to be gifted at developing workarounds on-the-spot that reliably work the first time.

formula_ninguna9y ago

Because computers also need to rest sometimes.

protomyth9y ago

Mistakes were made during the deploy of the new website to production. A failed website deploy is a bit more noticeable to the public than the failed deployment of an internal only system.

j / k navigate · click thread line to collapse

47 comments

43 comments · 17 top-level

Demcox9y ago· 7 in thread

"petrabytes"...really?

The most upvoted comment forgot how to spell (or doesn't know) to petabyte.

welly9y ago

I swear criticising someone's spelling is the last bastion in an argument/debate/discussion. When you haven't got anything else, attack their spelling.

I'm not saying that you're getting into an argument or debate but come on, you know what the guy meant.

Demcox9y ago

I'm sorry that you feel this way, but being a CS student, ones know the horrors such simple spelling errors can unravel in codes, machine architecture and programming to name a few.

Correct spelling and grammar is what makes your OS function, the doses of medicine prescribed correct and it can sometimes be the difference between life and death.

I view it as the foundation for any thing important that you want to communicate.

nicky09y ago

I feel it necessary to point out the erroneous extra "to" in your sentence.

DanBC9y ago

Skitt's Law!

http://knowyourmeme.com/memes/skitt-s-law

1 more reply

dkersten9y ago

Or they were typing on a phone which "autocorrected" it. Happens to me all the time.

cyberferret9y ago

grkvlt9y ago

sametmax9y ago· 6 in thread

No server NEEDS to go down for maintenance. You can avoid doing so for anything, at any scale, DB change, server updates, etc.

Putting a site on maintenance mode can be a middle ground you choose, because you don't want to invest that much just to avoid taking down you site for a little time once in a while.

Economics.

Of course, if you do choose the road of 0down time, you site will gain more than just availability, it will gain reliability as well, since those best practices serve both purposes.

amelius9y ago

One of the biggest problems is migrating your data to match updated code, while the system keeps running.

It's like changing the engine of a driving car.

sametmax9y ago

Yes but again you can create a system that allow that from the begining. It's just very, very costly to dev. And if you have legacy code, then a rewrite is even more prohibitive.

jholman9y ago

It's like changing the beating heart of a living person. That's probably impossible!

ebbv9y ago

mseebach9y ago

Of course they're able to. They just (rationally, typically) don't want to spend the money and effort getting there, and so they don't.

1 more reply

sametmax9y ago

The question is "why do they do it". And the answer is "cost".

It's not:

- because it's can't be done; - because we don't know the way to do it; - because it's better that way.

It's because it's very expensive. In your case, either because you don't create your whole system, or have a legacy system to use, doing it would be prohibitive. It's still a matter of cost.

greenleafjacob9y ago· 5 in thread

[1]: http://learnyousomeerlang.com/relups

endymi0n9y ago

There's more and more nuanced reasons actually:

1. Companies don't know how to do the engineering for maximum uptime, like you describe. It's way more complicated than the usual CRUD operations

gaius9y ago

Companies know how to but they decide not to invest this time

TheAceOfHearts9y ago

abritinthebay9y ago

Also immutable deploys.

icebraining9y ago

There's no such thing if you have a database.

1 more reply

ploggingdev9y ago· 5 in thread

I don't recall sites like Google or Facebook ever being down for maintenance. Are there any articles that discuss how they manage application layer and database layer migrations?

endymi0n9y ago

If that looks complicated, it is - and you better only start with these things if your site earns more money per minute than you need to pay engineers and project managers to pull that off.

jholman9y ago

tyingq9y ago

Both have the advantage of not having to present consistent data to end users.

dotancohen9y ago

brianwawok9y ago

Google is so big some % of it is always down. Just low chances to hit it, and if you do a reload and you will hit another server.

1 more reply

visarga9y ago· 2 in thread

They need to change the oil.

dingaling9y ago

Which tangentially is why the USAF's E-4B airborne command posts have to land after about three days. Fuel isn't a problem but they don't have a way to replenish engine-oil in-flight.

arachnidsOP9y ago

nickjackson9y ago· 1 in thread

viraptor9y ago

zer00eyz9y ago

Reasons I have been "down for maintenance" in the past.

sigi459y ago

Because of thing they were not thought of.

You don't see 'Maintenance' on systems of companies which do this for a long time. You might see this at 'normal' companies. Smaller ones who used the 'wrong' database and had to migrate it.

If you start with one database and 'forget' or just don't think about it to have a master, slave, slave combination, you have to fix that once.

When you made a mistake, you have to fix it once.

e0m9y ago

"They're replacing the vacuum tubes in the servers"

curt159y ago

I've always wondered whether Apple takes its website "down for maintenance" before a product launch out of necessity or simply to build excitement.

seanwilson9y ago

tyingq9y ago

The short answer is cost versus benefit.

For some types of websites, zero-downtime upgrades and maintenance are costly.

petters9y ago

Because it is much easier than performing complicated modifications while the site is running.

My internet bank is often down for maintenance on Sunday nights. I assume it is because they have a very old system.

heisenbit9y ago

PHP board software:

- sometimes the database of a board can corrupt. The repair operations (sort of a disk fsck for the board) require the database exclusively.

- software upgrades

fuzzfactor9y ago

Not every aircraft has all the expertise, tools, and spares on board at all times to be able to service or replace their engines in flight.

formula_ninguna9y ago

Because computers also need to rest sometimes.

protomyth9y ago

Mistakes were made during the deploy of the new website to production. A failed website deploy is a bit more noticeable to the public than the failed deployment of an internal only system.

j / k navigate · click thread line to collapse