GitHub Outage

140 pointsmre3y ago80 comments

Just noticed issues with Github handling requests and overall flakiness. Getting a lot of status code 500 errors and decided to open this thread for status updates.

80 comments

omn13y ago

Even though the status page (https://www.githubstatus.com/) shows no issues, I'm still getting the occasional 500. It seems to be happening quite irregularly. They are possibly facing a lot of load.

adamscybot3y ago

In my experience its often the case the status site does not reflect reality -- until someone intervenes.

naikrovek3y ago

that's the whole point of a manually updated status page. you don't want automation to update it because that automation can fail. automation likely caused the outage you want to know more about.

you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.

this is why status pages aren't updated by automation. if they're updated by a person, you know that people know about the problem, you know that people are working on the problem, and so on, which is good, but while they figure out what's going on, you see a "green" status page.

this is normal.

(this is for future readers, more than the person I am replying to.)

mejutoco3y ago

IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.

I am perfectly happy with a status page that shows the, mm, status of the service. Could be as easy as not reachable, slower than usual or any generic information (a traffic light). I disagree that a status page has to show the why of the error, although of course it would be nice.

1 more reply

judge20203y ago

Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1]. No, it doesn’t automatically create incidents, but it’s nice to confirm an issue is happening site-wide after experiencing it.

Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago

0: https://discordstatus.com/

1: https://www.redditstatus.com/

2 more replies

tpxl3y ago

I'm going to disagree here. The point of a manually updated status page is appearances.

With proper reporting it's trivial to know which subsystem is experiencing problems, if any. It doesn't have to be very granular, just "normal", "experiencing issues", "offline". If reporting doesn't work, you should be alerted it doesn't work, and if alerting doesn't work, there needs to either be out-of-band alerting for that or someone monitoring the status at all times.

Manual overrides for status pages should exist for when the automation doesn't work of course.

At my last job we had a big screen in the office we monitored (Grafana) and we usually saw problems before the alerting kicked in - it had about a minute delay. When not in-office/during work hours, the on-call received alerts. It wasn't technically nor organisationally complex.

smt883y ago

This is so wrong that it makes me wonder if it's satire.

"The whole point" (as you put it) of status pages was to publish high-level monitoring data to users. The monitoring process should occur outside the system that is being monitored, perhaps even on a different cloud.

Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.

Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.

1 more reply

MattIPv43y ago

Exactly, status pages tend to be updated by the humans responding to the incident, they're not automatic (that'd be pretty useless, you already know it's down, you want to know when they know it's down). Coordinating what to put on the status page when an incident happens can take time, getting the correct scope of impact from responding engineers etc.

pbhjpbhj3y ago

Sorry, I'm not following you, how do you know it's down when the status page says it's all working? At that point you assume it's not down and start checking your own systems. They're just lying to avoid fallout; it's not better than an automated page.

smt883y ago

"Humans responding to the incident" is what Twitter and email communications are for. Status pages are supposed to be realtime status, and they should show downtime as soon as users suspect it.

As a user, you often don't know if the vendor's system is really down or if there's something wrong with your own system.

hericium3y ago

Static sites need editors and editors sometimes have to ask permission to post.

At least that's what AWS Health[1] looks like to me.

[1] https://health.aws.amazon.com

deepstack3y ago

from the message I'm getting it seems like the load balancer is not able to spawn up server to handle new connections. Again, status page needs to reflect that, which means the status server page is NOT running on the same infrastructure as the main server group. Stop using AWS(or whatever fill in the blank hosting provider) for the status and production environments.

rozenmd3y ago

I monitor GitHub externally here: https://github.onlineornot.com/

Seems like a huge spike in load.

capableweb3y ago

Load as in load average? Or load as in traffic?

Spikes in request latency can be because of bunch of stuff, including more traffic, but in my experience, it's usually around non-existing optimizations for some data structure that got triggered after N items or new deploys containing code that wasn't as optimal as the author of the code thought. Especially when dealing with distributed systems, where sub-optimal code in one part can cascade performance issues to various parts in the system.

bowsamic3y ago

Status pages are absolutely useless. I've never seen them accurately reflect an outage

MattIPv43y ago

You are missing the point of a status page. They're not automatic things that tell you instantly when something is down -- that'd be pretty pointless, you already know it's down. They're updated by the folks responding to the incident, so you know they know there's an issue and that they're looking into it.

XCSme3y ago

> that'd be pretty pointless, you already know it's down

How would I know? What if my website doesn't have any monitoring and I use a payment system, shouldn't I automatically be notified when that payment system is down? What if it's down for a week? I think service-providing companies should always announce outages and even suspected outages.

mejutoco3y ago

I agree with GP. If I am trying to, let's say, watch something on Netflix and it is not working, a status page would confirm that Netflix is down in my region, and I would know that there is nothing wrong with my connection, DNS, or any other potential cause.

Because of this reason I believe they would not be pointless if they were simply status pages, instead of "incident response pages". My hypothesis for them being this way instead is it is too much transparency for some companies for PR and legal reasons.

bowsamic3y ago

Then it's not an operational status, it's an engineering status. Clearly it is very misleading. I think most people, even devs, think these pages are supposed to reflect the current situation. Btw the Github one still doesn't

rozenmd3y ago

That's the premise of my whole business - there's definitely a market for an automated status page!

(https://onlineornot.com)

MattIPv43y ago

Been running into many unicorns for the last few minutes, had a moment where it came back but seems to be down again. Even the unicorn image won't load on the unicorn page.

omn13y ago

Which is weird because the unicorn is an inlined image (png), encoded in base64. Seems like they broke it.

MattIPv43y ago

Yah, looks like their CSP blocks it for some reason

qwertox3y ago

I haven't pushed to GitHub in over a year. Now I'm setting up a new page on github.io with a new repo and GitHub goes 500 just when I try to push.

Those GitHub badges... they are as ugly as it gets.

naikrovek3y ago

well, I for one would like more unrequested critique of artwork on a code sharing website. ಠ_ಠ

irajdeep3y ago

> Those GitHub badges... they are as ugly as it gets.

Bingo. Not everything in this world needs to be gamified.

blueflow3y ago

Do we need to create a HN post for every outage? It happens every other week.

capableweb3y ago

In the beginning of status pages, most of them were automatic one way or another, or engineers quickly threw up "We know of the problem, stay tuned" messages there.

But soon after, legal/executive team got ownership of them apparently, and the status pages are no longer automatically showing downtime/response time and notice about when things are actually down can take a while.

So I think it's nice that there is at least one place where I can see if it's a problem on my end, or if it's global. It helps to remove some frustration at least.

RcouF1uZ4gsC3y ago

What else are you supposed to do when you can’t work because you don’t have access to your source code.

naikrovek3y ago

you still have access to your source code, you just can't push. or pull, but you can sneakernet around that, or have a second remote set in the repo for just this occasion, so you can collaborate as a stop-gap measure while GH gets fixed up.

LtWorf3y ago

git works completely fine offline.

However I have a feeling that most companies are set up to download 50MiB of dependencies at every run, so a website being down makes the entire thing not work.

corford3y ago

Yep, noticed it with comments on an issue (had timeouts while submitting but it eventually went through).

Now 30 mins later, i've refreshed the issue and see that my reply and the comment I was replying too (by another user) are both gone. Hopefully, it's eventually consistent and these comments will re-appear later.

kid643y ago

It's completely down for me. Status page says "all systems operational".

fredrikaverpil3y ago

The service seems very flaky right now. Even the unicorn isn't loading properly.

sivapil3y ago

Still getting this same error for past 10-12 hours. Tried in different times.

{ "code": 500, "message": "internal server error" }

Does anyone have luck? Any workaround to fix it?

jmartens3y ago

According to Metrist monitoring (disclosure: I work there), the errors were very rare, and didn't happen enough for us to call the product "down." Looks like around 1% of requests.

WFHRenaissance3y ago

I'm unicorning hard rn

bilalq3y ago

How reliable are Github cron action workflows? I set one up to run every 15 minutes recently, but it seems to actually be running closer to once an hour.

view3y ago

I'm trying to clone a repo at a whopping 6KB/s from Kenya.

EDIT: Seems to be a routing issue. I've enabled a UK VPN and it's working fine now.

cherryblossom003y ago

‘No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists.’

bityard3y ago

Interesting how outages like this seem to happen mostly on Monday^w Tuesday mornings.

HJain133y ago

Its Tuesday my man :)

busymom03y ago

They are probably Canadian (Canada had Monday off due to Labour Day, so people get confused).

b1123y ago

Canadians are never confused, how dare you! I ... oh wait.

Sorry, was confused.

1 more reply

kleinsch3y ago

If you have a risky commit, esp with a three day holiday weekend, you wait to land it until after the weekend.

planxty3y ago

it is tuesday my dudes :P

LtWorf3y ago

Monday was a holiday in USA

max23_3y ago

I get an error saying the action can't be performed when trying to star a repo.

azeemh3y ago

this is what happens when you sell important community infrastructure to M$FT

dave44203y ago

This predates them getting bought by MS. GitHub was notoriously flaky from the very beginning.

adamscybot3y ago

Its back

alexandremonjol3y ago

Seems to be back right ?

isusmelj3y ago

Same issue. Site is also very non-responsive.

tambourine_man3y ago

They’ve been having issues since yesterday.

jmartens3y ago

I'd say since Friday night Pacific time.

ohmahjong3y ago

Time for some deep thumb-twiddling

JSDevOps3y ago

Yeah I noticed it earlier.

planxty3y ago

Seeing the same.

maxcan3y ago

Github outages are the bored engineer's equivalent of getting a surprise snow day when you were in school, full of unbridled joy.

For engaged, happy engineers its the equivalent of getting a surprise snow day when you are grown up and have to go dig your car out of the snow and its a normal day just with extra steps.

ramigb3y ago

I can be both bored and engaged, don't test me buddy /s

lysergia3y ago

> equivalent of getting a surprise snow day

Not if you self-host Git

capableweb3y ago

Self-hosting Git is easy, throw up a ssh server and point git to it.

Self-hosting everything else GitHub does is harder. Which is why they are building out all of those things, they don't want people to move to other places so easily.

Hopefully these constant outages makes more developers pissed off that issues are not stored in git as well, and start working on tooling to solve this shitty problem once and for all.

P2P/Local First software for everyone! \o/

chrisseaton3y ago

> Self-hosting everything else GitHub does is harder.

You can self-host the whole of GitHub can’t you?

1 more reply

lionkor3y ago

gitlab

1 more reply

naikrovek3y ago

Not really. I'd rather be getting stuff done. ¯\_(ツ)_/¯

edit: oopsie I misread.

mort963y ago

Would you characterise yourself as a bored engineer?

naikrovek3y ago

today I might characterize myself as someone who made a mistake while reading a comment, and replied to the mistaken understanding instead of the intended one.

wheelerof4te3y ago

You still can use git, you just can't push the code.

Not a huge problem, unless it lasts for hours or gasp, days.

j / k navigate · click thread line to collapse

80 comments

omn13y ago

Even though the status page (https://www.githubstatus.com/) shows no issues, I'm still getting the occasional 500. It seems to be happening quite irregularly. They are possibly facing a lot of load.

adamscybot3y ago

In my experience its often the case the status site does not reflect reality -- until someone intervenes.

naikrovek3y ago

that's the whole point of a manually updated status page. you don't want automation to update it because that automation can fail. automation likely caused the outage you want to know more about.

you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.

this is normal.

(this is for future readers, more than the person I am replying to.)

mejutoco3y ago

Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.

1 more reply

judge20203y ago

Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago

0: https://discordstatus.com/

1: https://www.redditstatus.com/

2 more replies

tpxl3y ago

I'm going to disagree here. The point of a manually updated status page is appearances.

Manual overrides for status pages should exist for when the automation doesn't work of course.

smt883y ago

This is so wrong that it makes me wonder if it's satire.

Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.

Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.

1 more reply

MattIPv43y ago

pbhjpbhj3y ago

smt883y ago

"Humans responding to the incident" is what Twitter and email communications are for. Status pages are supposed to be realtime status, and they should show downtime as soon as users suspect it.

As a user, you often don't know if the vendor's system is really down or if there's something wrong with your own system.

hericium3y ago

Static sites need editors and editors sometimes have to ask permission to post.

At least that's what AWS Health[1] looks like to me.

[1] https://health.aws.amazon.com

deepstack3y ago

rozenmd3y ago

I monitor GitHub externally here: https://github.onlineornot.com/

Seems like a huge spike in load.

capableweb3y ago

Load as in load average? Or load as in traffic?

bowsamic3y ago

Status pages are absolutely useless. I've never seen them accurately reflect an outage

MattIPv43y ago

XCSme3y ago

> that'd be pretty pointless, you already know it's down

mejutoco3y ago

bowsamic3y ago

rozenmd3y ago

That's the premise of my whole business - there's definitely a market for an automated status page!

(https://onlineornot.com)

MattIPv43y ago

Been running into many unicorns for the last few minutes, had a moment where it came back but seems to be down again. Even the unicorn image won't load on the unicorn page.

omn13y ago

Which is weird because the unicorn is an inlined image (png), encoded in base64. Seems like they broke it.

MattIPv43y ago

Yah, looks like their CSP blocks it for some reason

qwertox3y ago

I haven't pushed to GitHub in over a year. Now I'm setting up a new page on github.io with a new repo and GitHub goes 500 just when I try to push.

Those GitHub badges... they are as ugly as it gets.

naikrovek3y ago

well, I for one would like more unrequested critique of artwork on a code sharing website. ಠ_ಠ

irajdeep3y ago

> Those GitHub badges... they are as ugly as it gets.

Bingo. Not everything in this world needs to be gamified.

blueflow3y ago

Do we need to create a HN post for every outage? It happens every other week.

capableweb3y ago

In the beginning of status pages, most of them were automatic one way or another, or engineers quickly threw up "We know of the problem, stay tuned" messages there.

So I think it's nice that there is at least one place where I can see if it's a problem on my end, or if it's global. It helps to remove some frustration at least.

RcouF1uZ4gsC3y ago

What else are you supposed to do when you can’t work because you don’t have access to your source code.

naikrovek3y ago

LtWorf3y ago

git works completely fine offline.

However I have a feeling that most companies are set up to download 50MiB of dependencies at every run, so a website being down makes the entire thing not work.

corford3y ago

Yep, noticed it with comments on an issue (had timeouts while submitting but it eventually went through).

kid643y ago

It's completely down for me. Status page says "all systems operational".

fredrikaverpil3y ago

The service seems very flaky right now. Even the unicorn isn't loading properly.

sivapil3y ago

Still getting this same error for past 10-12 hours. Tried in different times.

{ "code": 500, "message": "internal server error" }

Does anyone have luck? Any workaround to fix it?

jmartens3y ago

According to Metrist monitoring (disclosure: I work there), the errors were very rare, and didn't happen enough for us to call the product "down." Looks like around 1% of requests.

WFHRenaissance3y ago

I'm unicorning hard rn

bilalq3y ago

How reliable are Github cron action workflows? I set one up to run every 15 minutes recently, but it seems to actually be running closer to once an hour.

view3y ago

I'm trying to clone a repo at a whopping 6KB/s from Kenya.

EDIT: Seems to be a routing issue. I've enabled a UK VPN and it's working fine now.

cherryblossom003y ago

‘No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists.’

bityard3y ago

Interesting how outages like this seem to happen mostly on Monday^w Tuesday mornings.

HJain133y ago

Its Tuesday my man :)

busymom03y ago

They are probably Canadian (Canada had Monday off due to Labour Day, so people get confused).

b1123y ago

Canadians are never confused, how dare you! I ... oh wait.

Sorry, was confused.

1 more reply

kleinsch3y ago

If you have a risky commit, esp with a three day holiday weekend, you wait to land it until after the weekend.

planxty3y ago

it is tuesday my dudes :P

LtWorf3y ago

Monday was a holiday in USA

max23_3y ago

I get an error saying the action can't be performed when trying to star a repo.

azeemh3y ago

this is what happens when you sell important community infrastructure to M$FT

dave44203y ago

This predates them getting bought by MS. GitHub was notoriously flaky from the very beginning.

adamscybot3y ago

Its back

alexandremonjol3y ago

Seems to be back right ?

isusmelj3y ago

Same issue. Site is also very non-responsive.

tambourine_man3y ago

They’ve been having issues since yesterday.

jmartens3y ago

I'd say since Friday night Pacific time.

ohmahjong3y ago

Time for some deep thumb-twiddling

JSDevOps3y ago

Yeah I noticed it earlier.

planxty3y ago

Seeing the same.

maxcan3y ago

Github outages are the bored engineer's equivalent of getting a surprise snow day when you were in school, full of unbridled joy.

For engaged, happy engineers its the equivalent of getting a surprise snow day when you are grown up and have to go dig your car out of the snow and its a normal day just with extra steps.

ramigb3y ago

I can be both bored and engaged, don't test me buddy /s

lysergia3y ago

> equivalent of getting a surprise snow day

Not if you self-host Git

capableweb3y ago

Self-hosting Git is easy, throw up a ssh server and point git to it.

Self-hosting everything else GitHub does is harder. Which is why they are building out all of those things, they don't want people to move to other places so easily.

Hopefully these constant outages makes more developers pissed off that issues are not stored in git as well, and start working on tooling to solve this shitty problem once and for all.

P2P/Local First software for everyone! \o/

chrisseaton3y ago

> Self-hosting everything else GitHub does is harder.

You can self-host the whole of GitHub can’t you?

1 more reply

lionkor3y ago

gitlab

1 more reply

naikrovek3y ago

Not really. I'd rather be getting stuff done. ¯\_(ツ)_/¯

edit: oopsie I misread.

mort963y ago

Would you characterise yourself as a bored engineer?

naikrovek3y ago

today I might characterize myself as someone who made a mistake while reading a comment, and replied to the mistaken understanding instead of the intended one.

wheelerof4te3y ago

You still can use git, you just can't push the code.

Not a huge problem, unless it lasts for hours or gasp, days.

j / k navigate · click thread line to collapse