Total audio delay is from record buffering, sampling (typically 20 ms samples), encoding, packetization (1-5 samples per packet), time in transit, decode, jitter buffer, playout buffer.
You could reduce sample size, and send fewer samples per packet to reduce total delay, but overhead goes way up (overhead is near 50% at 20ms samples, one per packet). In theory, you should be able to do something nice for people doing audio and video by including audio on the video packets, but it's not simple, so I think most conferences don't do it.