Technically you wouldn't have to re-render (i.e. recompress), you could just dynamically splice in different compressed audio frames. AAC (what MP4 uses) does this well, MP3 less so but for vast majority of listeners it's still indiscernible, especially if there's a slight audio gap. The CPU load is de minimis, though you do still need to track frames--can't just move blobs from disk to socket using sendfile(2). I did this for a radio streaming server that supported per-listener ads--everybody received the same content stream (or at least, one of several codec+format pipelines, i.e. RTP+AAC, FLV+MP3, etc), but when ad spots were detected they were dynamically swapped with different spots that varied per output stream. From the perspective of downstream software clients it was still always one long, unbroken encoded stream. I never got around to supporting compressed video frame splicing, but it would work the same way, modulo the fact that your key frames tend to be spaced out more so it wouldn't always be so seamless with ad hoc content (you'd definitely want a brief screen blackout or carefully placed key frames for the transition like you often see in broadcast television).
But while conceptually easy to do this, I've never seen other streaming servers do this, certainly not open source ones. And of course the most obvious, simplest way is to just pre-render a bunch of static files and serve using a dumb HTTP server.