Well, at first I was able to gather and correlate enough cpu, temperature, entrypoint data for apparently problematic servers.
The servers were shutting down due to high temperatures caused by persistent high cpu usage.
Knowing that, I installed datadog with APM on just a couple of the servers (because $$) which led me to postgres issues (indexing), weasy pdf generation issues (a python lib), and some bad django code (queryset to list before pagination).