Heh, welcome to the world of livestreaming media. The reason why it's hard to create this kind of simple "stream in, stream out" abstraction is because most IP Voice/Video stacks are architected very differently than stateless net protocols that are popular today. IP streaming generally works by:
1. A signaling layer that helps setup the connection metadata (a layer where the sender can say they're the sender, that they'll be sending data to port n, that the data will be encoded using codec foo, etc)
2. Media streams that are opened based on the metadata transferred over the signaling layer that are usually just streams of encoded packets being pushed over the wire as fast as the media source and the network allows.
Most IP Media stacks (RTSP, RTMP, WebRTC, SIP, XMPP, Matrix, etc) follow this same pattern. This is different than "modern" protocols like HTTP where signaling is bound together with data using framing (e.g. HTTP headers for signaling vs the HTTP request/response body for data.) This design makes IP media stacks especially fragile to NAT connectivity issues and especially hard to proxy. There are typically good reasons this is done (due to latency, non-blocking reads, head-of-line blocking, etc) but these "good reasons" are becoming less good as innovations in lower networking layers (like QUIC or TCPLS) create conditions that make it much easier to organize IP Media in a manner more similar to HTTP. Hopefully one day you'll just be able to take IP Media streams and "convert" or "proxy" them from one format to another.