I find it surprisingly hard to judge, because the edit style is so obnoxious and poorly executed. I
think it’s done by a human, but the speech and editing are both sufficiently lousy that I’m not confident from this video alone—though another couple of videos I tried were definitely somewhat better. The aggressive cutting style is just bad editing, and the speaker has spoken each phrase independently with no attempt to bridge them (which a competent speaker would do). The first ten seconds are particularly grating, with thoroughly unnatural emphasis on the start of most syllables, almost as though each word or even
syllable had been spoken independently and then glued together (e.g. And, A, Light, Weight, Rust, Back, End). The prosody is also regularly quite a bit off, and it’s hard to determine if that’s related to the bad editing and speech segmentation, or an independent issue. There’s enough that feels natural in things like intonation that I don’t
think it’s TTS, but it’s also making a
lot of the sorts of errors that even the best TTS engines habitually make.
I get the impression that it’s a human who doesn’t know how to speak or edit particularly well but normally gets away with it not being too bad (as I say, other videos seem to be better, though they still suffer from the aggressive cut style), but compromised harshly on quality in this case in order to squish more into the hundred seconds… and still ended up 50% over time.
I very strongly dislike the general style.