SFUs aren't forwarding everything. Using either discrete simulcast layers or SVC they're forwarding one size out of several (usually 3) to each other participant.
Bandwidth will be a high cost no matter what (and anyone planning to scale up real-time video better be prepared to move off the cloud at some point) but needing compute resources for encode/decode on the backend makes large scale realtime video infeasible for any reasonable cost. The only successful large scale deployments push most compute work to the clients.
An extra decode/encode step also adds latency and the total latency budget before UX is impacted is small.