Good observability on operational metrics, as much automation as possible, clear processes and playbooks for escalations.
Most engineering orgs I’ve seen require Eng oncalls. The perspective above is a simple non expensive way to minimize time investigations eg 5% movement of a metric. Teams spend lots of time asking “is this a real world effect, if so, should we be worried?”