Even with say n=5, P2P pushes the required upstream bandwidth beyond what is practical with tight realtime constraints on a surprising proportion of home and office connections, especially when Wi-Fi is involved. SFU cuts the upstream bandwidth requirement to be constant rather than a factor of how many other participants are in the conference.
There's also the fact that a significant proportion of offices, and an increasing proportion of homes, simply cannot establish the needed P2P connectivity due to double NAT, firewall, etc. For them you'll be providing a TURN server, which brings back the bandwidth requirement.
The bandwidth costs, if you move off the clouds, are actually quite manageable compared to the compute costs if you try to do decode/encode on the servers.
If we are talking about doing realtime video at large scale, we can assume that there is good ops competency in-house. If not, honestly, what are you doing?
Managing fleets of servers across several of the "automation-friendly-but-not-cloud" providers using modern tools is not difficult for some moderately-expensive-but-not-as-expensive-as-cloud-services ops engineers.