An article sticking around too long on the home page. Semi-stale data creeping into your pipeline. Someone's security token being accepted post-revocation. All really hard to spot unless (1) you're explicitly looking, or (2) manure hits the fan.
- Using asynchronous database replication and reading data from database slaves - Duplicating same data over multiple database tables (possibly for performance reasons) - Having additional system that duplicates some data. For example: in the middle of rewriting some legacy system - a process that was split into phases so functionality between new and old systems overlap for some period of time.
Based on my experience I always assume that inconsistency is unavoidable when the same information is stored in more than one place.
The long listen queue -> multiple queued up retries feedback loop is a classic: https://datatracker.ietf.org/doc/html/rfc896 TCP/IP "congestion collapse" and the 1986 Internet meltdown [various sources]
The fact that there are at least 3 twitter clones that are less well put together with a decent amount of users handling in the load proves that it is possible.
Obviously copying decades of improvements and scaling lessons you cannot copy unless someone made a product of those parts and you can use those.
I'd imagine most lost knowledge is not an explicit decision however which means such historical scenarios / documentation / ... are just lost as part of business. Lost knowledge is the default for companies.
Twitter is likely better than most given their documentation is all digital and there exist explicit processes to catalogue such incidents. I'd also be curious to see how much of this knowledge has been implicitly exported to their open source codebases.
As you say, the default tendency in many companies when failures occur is information-loss. That can be attributed to using too many communication tools, cultural expectations that problems should be hidden, silo'd or disparate documentation stores, or lack of process.
Intentional, open, thorough and replicated note-taking with cross-references before, during and after incidents can create radically different environments which allow for querying, recovery and improvement regardless of failure mode(s). Kudos to Dan for moving in that direction with these writeups (and to you for raising the subtext).
> There are only three hard things in Computer Science: cache invalidation and naming things.
Horrifically inappropriate inclusion of PII in this post. Didn’t someone at legal go through this?
> Wolfstar_Bachi @tigertwo
> Wolfstar is an online and social media PR agency that specialises in helping some of the world’s best companies to communicate more effectively.