A decade of major cache incidents at Twitter (opens in new tab)

(danluu.com)

138 pointsSmerity4y ago26 comments

26 comments

jsty4y ago

Major incidents aside, I always think that cache-related bugs are some of the most likely to go undetected since if you don't test for them end-to-end, they're really not that easy to spot & diagnose.

An article sticking around too long on the home page. Semi-stale data creeping into your pipeline. Someone's security token being accepted post-revocation. All really hard to spot unless (1) you're explicitly looking, or (2) manure hits the fan.

daper4y ago

I categorize this as bugs caused by data inconsistency because od data duplication. That includes:

- Using asynchronous database replication and reading data from database slaves - Duplicating same data over multiple database tables (possibly for performance reasons) - Having additional system that duplicates some data. For example: in the middle of rewriting some legacy system - a process that was split into phases so functionality between new and old systems overlap for some period of time.

Based on my experience I always assume that inconsistency is unavoidable when the same information is stored in more than one place.

rightbyte4y ago

Microsoft has some serious problem with token caching. I changed job last month and for two three weeks I could log into my old work account for a split second before being thrown out. (By habit visited the page). I could see the news feed and mails but not long enought to see if they were stale.

terom4y ago

Required reading for all of the "I could code up Twitter in a weekend" -types.

The long listen queue -> multiple queued up retries feedback loop is a classic: https://datatracker.ietf.org/doc/html/rfc896 TCP/IP "congestion collapse" and the 1986 Internet meltdown [various sources]

YEwSdObPQT4y ago

It ultimately depends at what scale. I think most people to be fair are talking about building a clone of what was present back in the mid 2000s. You could build a twitter clone that could handle a few hundred users in a few weeks with modern tech stacks.

The fact that there are at least 3 twitter clones that are less well put together with a decent amount of users handling in the load proves that it is possible.

acdha4y ago

“A few weeks” sounds a lot longer than a weekend, and I’d also consider the history: Twitter itself was built quickly using a modern stack. Rails was highly productive, the problem is that the concept of the service makes scaling non-trivial. We have more RAM and SSDs now so you could get further but those aren’t magic.

tluyben24y ago

Twitter was down all the time for hours after the launch. I don’t think I have an issue coding something that goes down when it overloads with functionality twitter had when it launched in a weekend. Most work for that kind of project goes into interpreting the specs/your business colleagues and fixing the mishaps; here you don’t have that.

1 more reply

YEwSdObPQT4y ago

I agree. I was playing devil's advocate to an extent. At the time Rails was the first modern MVC framework as we understand it.

> Rails was highly productive,the problem is that the concept of the service makes scaling non-trivial.

Didn't they rewrite everything in PHP during the last 2000s due to Rails at the time just not being able to scale?

tluyben24y ago

Those remarks are always made at launch, not later on. Dropbox and twitter, both of which people said this about, were rather trivial at launch especially with modern tooling. They also, and especially twitter, had growing pains. Twitter defo prioritised move fast and break things.

Obviously copying decades of improvements and scaling lessons you cannot copy unless someone made a product of those parts and you can use those.

SmerityOP4y ago

What I find most interesting in this is the pseudo detective story of hunting down disappearing post-mortem and "lessons learned" documentation. Optimistically we'd hope that perhaps the older systems no longer reflect the existing systems in any meaningful way (possibly as the org structures and/or software stacks shift and change) and they're no longer relevant.

I'd imagine most lost knowledge is not an explicit decision however which means such historical scenarios / documentation / ... are just lost as part of business. Lost knowledge is the default for companies.

Twitter is likely better than most given their documentation is all digital and there exist explicit processes to catalogue such incidents. I'd also be curious to see how much of this knowledge has been implicitly exported to their open source codebases.

jka4y ago

What you've said is, in my opinion, likely to be a difference between the technology companies that become tomorrow's infrastructure and the ones that disappear (even if it takes decades).

As you say, the default tendency in many companies when failures occur is information-loss. That can be attributed to using too many communication tools, cultural expectations that problems should be hidden, silo'd or disparate documentation stores, or lack of process.

Intentional, open, thorough and replicated note-taking with cross-references before, during and after incidents can create radically different environments which allow for querying, recovery and improvement regardless of failure mode(s). Kudos to Dan for moving in that direction with these writeups (and to you for raising the subtext).

plasma4y ago

I remember reading Facebooks caches had a dedicated standby set of “gutter” servers that would take over a failure quickly (otherwise inactive and unused) that was an interesting mitigation for some failure scenarios.

Jach4y ago

These big incidents involving 'big cache' are fun to read about. Years ago I had to deal with a bunch of cache issues over a short time, but they were all minor incidents with minor uses of cache (simple memoization, storing stuff in maps on attributes of java singletons, browser local storage). Still, I made a checklist of questions to ask thenceforth on any proposal or implementation of a cache in a doc or code review. A bunch of them are just focused on actually paying attention to what your keys are made of and how invalidation works (or if you even can invalidate, or if it's even needed). I think for 'big cache' questions I should just refer to this blog post and ask "what's the risk of these issues?"

wizwit9994y ago

Yeah, see also, Marc Brooker has a good article on why the bimodal behavior of caches can cause a lot of headaches https://brooker.co.za/blog/2021/08/27/caches.html

mprovost4y ago

"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton

https://martinfowler.com/bliki/TwoHardThings.html

dspillett4y ago

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

Kavelach4y ago

There are only two hard problems in Computer Science: there's one joke, and it's not even funny

capableweb4y ago

I prefer the less obvious version of this one:

> There are only three hard things in Computer Science: cache invalidation and naming things.

giantrobot4y ago

That's the one I use all the time.

spoonjim4y ago

“ On Nov 8, a user changed their name from tigertwo to Woflstar_Bachi.”

Horrifically inappropriate inclusion of PII in this post. Didn’t someone at legal go through this?

formerly_proven4y ago

It's a current, public profile:

> Wolfstar_Bachi @tigertwo

> Wolfstar is an online and social media PR agency that specialises in helping some of the world’s best companies to communicate more effectively.

j / k navigate · click thread line to collapse

26 comments

jsty4y ago

daper4y ago

I categorize this as bugs caused by data inconsistency because od data duplication. That includes:

Based on my experience I always assume that inconsistency is unavoidable when the same information is stored in more than one place.

rightbyte4y ago

terom4y ago

Required reading for all of the "I could code up Twitter in a weekend" -types.

YEwSdObPQT4y ago

The fact that there are at least 3 twitter clones that are less well put together with a decent amount of users handling in the load proves that it is possible.

acdha4y ago

tluyben24y ago

1 more reply

YEwSdObPQT4y ago

I agree. I was playing devil's advocate to an extent. At the time Rails was the first modern MVC framework as we understand it.

> Rails was highly productive,the problem is that the concept of the service makes scaling non-trivial.

Didn't they rewrite everything in PHP during the last 2000s due to Rails at the time just not being able to scale?

tluyben24y ago

Obviously copying decades of improvements and scaling lessons you cannot copy unless someone made a product of those parts and you can use those.

SmerityOP4y ago

jka4y ago

What you've said is, in my opinion, likely to be a difference between the technology companies that become tomorrow's infrastructure and the ones that disappear (even if it takes decades).

plasma4y ago

Jach4y ago

wizwit9994y ago

Yeah, see also, Marc Brooker has a good article on why the bimodal behavior of caches can cause a lot of headaches https://brooker.co.za/blog/2021/08/27/caches.html

mprovost4y ago

"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton

https://martinfowler.com/bliki/TwoHardThings.html

dspillett4y ago

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

Kavelach4y ago

There are only two hard problems in Computer Science: there's one joke, and it's not even funny

capableweb4y ago

I prefer the less obvious version of this one:

> There are only three hard things in Computer Science: cache invalidation and naming things.

giantrobot4y ago

That's the one I use all the time.

spoonjim4y ago

“ On Nov 8, a user changed their name from tigertwo to Woflstar_Bachi.”

Horrifically inappropriate inclusion of PII in this post. Didn’t someone at legal go through this?

formerly_proven4y ago

It's a current, public profile:

> Wolfstar_Bachi @tigertwo

> Wolfstar is an online and social media PR agency that specialises in helping some of the world’s best companies to communicate more effectively.

j / k navigate · click thread line to collapse