Peer-to-peer communication in WhatsApp in the network topology sense happens where possible when making Voice and Video calls, as this is probably WebRTC-derived (it is WebRTC in everything else these days), which concretely involves some kind of call signalling, then p2p setup to talk RTP if possible. This is not Signal Protocol or Noise: it is most likely the S in SRTP with key agreement done over the Signal Protocol. In other words, no key ratcheting between voice or video packets. I'm actually not sure if the session key is ever changed for a given call. To make this clear: call setup happens via a central server but the media streams will go from your IP to theirs directly, if possible (or proxied via WhatsApp if not). The reason for doing calls p2p like this is where possible is to reduce latency.
This is also, last time I looked, true of Signal. We are good at end-to-end text. We are less good at voice/video, particularly voice/video group calls that might not be p2p-able and rather require the server to do something with the RTP streams.
Now, what you're actually missing is that WhatsApp was in its early days based on a fork of ejabberd, the Erlang XMPP Server, with if I understand correctly custom extensions. Thus WhatsApp actually was at some stage somewhat compatible with open standards.
We've also kinda been here before. Google Talk used to interoperate with XMPP just fine and at one stage my own XMPP server could talk to my friends on Google Talk and they'd pretty much not notice.
I agree however that it would be better to have a new protocol that starts based on end to end key agreement like Signal/Noise, rather than use XMPP. Or perhaps use XMPP _inside_ this protocol. This is because "opt-in" crypto is a disaster that probably has happened. Signal and Noise are also missing what the body of those messages should look like and standards for agreeing for example calls, media transfer and so on, basically all the non-crypto parts.