- Their database permissions changed unexpectedly (??)
- This caused a 'feature file' to be changed in an unusual way (?!)
- Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
- Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
- They hit an internal application memory limit and that just... crashed the app
- The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
- After fixing it, they were vulnerable to a thundering herd problem
- Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.