There's also the aspect of if we make it search out for distinct frames (like "scene change" type things), there's the possibility that some safe things could become auto-moderated which would be another controversy. Perhaps if they could locate start/stop times for inappropriate elements of the source video and then check for videos containing any of those in a later uploaded video, but then instead of asking human moderators to discern "is this video good or bad" we're now forcing them to get timestamps of when bad elements take place which would force them to really watch the video.
Dunno, I'm of the belief that it's just not an easily solved problem. I hope I'm wrong and they can toss money at the problem to fix it, but I genuinely don't know how they could.