undefined | Better HN

0 pointsdeathanatos7y ago0 comments

It's DNS, so it is somewhat inherently global. Route53 isn't region specific either, so I could see an issue with that having a global effect, too.

0 comments

9 comments · 2 top-level

y0y7y ago· 7 in thread

DNS is also inherently distributed. This should make it resilient to all of the most common outage scenarios, and is likely why AWS offers a 100% uptime SLA for Route 53.

I'll be interested in the post-mortem from Azure on this one.

deathanatosOP7y ago

> likely why AWS offers a 100% uptime SLA for Route 53

Well, that's interesting. We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist. (We've not got a reproducible case for this yet, and it's incredibly rare for any given VM/service. But across our fleet, it crops up fairly regularly.)

donavanm7y ago

I used to work on route 53 for a few years. I cant speak to your specific issue. Too much depends on your clients, your networks, your resolvers. But ... turn on query logging at a minimum. You should get a timestamp, qname, and rtype to identify nxdomain.

That said the most common cause of authoritative nxdomain is if youre adding/deleting records and querying them before propagation is complete. You may want to log/poll your rrset change status separately to correlate.

The other is that depending on networks intermediate dns tampering happens all the time. Qname, rname, rtype, all get modified. Responses and queries are duplicated, intercepted, and manipulated. Some good research out of dns oarc and a dude out of australia (iirc).

cthalupa7y ago

> We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist.

That could be whatever resolvers you're hitting failing rather than an issue with Route 53 authoritative nameservers, though. The resolving DNS servers in EC2 are not actually part of Route 53, for example.

1 more reply

leesalminen7y ago

We’ve experienced the same thing. I’ve never been able to figure it out. If you ever do, please let me know! I’ll owe you a beer ;)

hfern7y ago

You may be hitting ec2 dns rate limits.

1 more reply

el_duderino7y ago

Do they typically provide a postmortem?

crankylinuxuser7y ago

It's Microsoft. I'm sure they just rebooted it!

(I had to, see username!)

edit: seriously,-3 ? it was a joke.

aioprisan7y ago

Sure, but that's hypothetical, and I don't recall AWS having any such issue in recent history.

j / k navigate · click thread line to collapse

0 comments

9 comments · 2 top-level

y0y7y ago· 7 in thread

DNS is also inherently distributed. This should make it resilient to all of the most common outage scenarios, and is likely why AWS offers a 100% uptime SLA for Route 53.

I'll be interested in the post-mortem from Azure on this one.

deathanatosOP7y ago

> likely why AWS offers a 100% uptime SLA for Route 53

donavanm7y ago

cthalupa7y ago

> We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist.

1 more reply

leesalminen7y ago

We’ve experienced the same thing. I’ve never been able to figure it out. If you ever do, please let me know! I’ll owe you a beer ;)

hfern7y ago

You may be hitting ec2 dns rate limits.

1 more reply

el_duderino7y ago

Do they typically provide a postmortem?

crankylinuxuser7y ago

It's Microsoft. I'm sure they just rebooted it!

(I had to, see username!)

edit: seriously,-3 ? it was a joke.

aioprisan7y ago

Sure, but that's hypothetical, and I don't recall AWS having any such issue in recent history.

j / k navigate · click thread line to collapse