While I have not been this deep into the inner workins, I personally have skimmed that rabbit hole when I started a side-project that was essentially grep but for (mainly) MKV files.
Idea was that you could "grep" by specific text in the subtitles and automatically create a clip of every occurrence of the text (by looking at the subtitle timing and padding that in both directions).
The biggest source of my frustration was that I was unable to get the clipping to work exactly as I wanted, where the start or end of the clip would seemingly drift back and forth. That was until I realized it boiled down to how the different seek modes in ffmpeg handled keyframes.
I still haven't gotten the clipping to work exactly as I want but I figured doing two passes might be the way to go: first pass would do a fuzzy match and ensure there is enough extra on both ends of the desired clip and the second pass could re-encode the fuzzy-matched clip to shuffle the keyframes around, allowing more accurate clipping.