There could be a way for responder to signal where the content they are answering starts, with some sort of fuzzy automation in the future. I have strong doubts about the actual experience of this for the listener, but maybe that's solvable.
I meant situation, where I already consumed the whole recording, but it gets response later on.
I do not have mental model for context being logically attached to the response. Do you think about it as response+context being a valid piece of content?