undefined | Better HN

0 pointschrismorgan4y ago0 comments

I find it surprisingly hard to judge, because the edit style is so obnoxious and poorly executed. I think it’s done by a human, but the speech and editing are both sufficiently lousy that I’m not confident from this video alone—though another couple of videos I tried were definitely somewhat better. The aggressive cutting style is just bad editing, and the speaker has spoken each phrase independently with no attempt to bridge them (which a competent speaker would do). The first ten seconds are particularly grating, with thoroughly unnatural emphasis on the start of most syllables, almost as though each word or even syllable had been spoken independently and then glued together (e.g. And, A, Light, Weight, Rust, Back, End). The prosody is also regularly quite a bit off, and it’s hard to determine if that’s related to the bad editing and speech segmentation, or an independent issue. There’s enough that feels natural in things like intonation that I don’t think it’s TTS, but it’s also making a lot of the sorts of errors that even the best TTS engines habitually make.

I get the impression that it’s a human who doesn’t know how to speak or edit particularly well but normally gets away with it not being too bad (as I say, other videos seem to be better, though they still suffer from the aggressive cut style), but compromised harshly on quality in this case in order to squish more into the hundred seconds… and still ended up 50% over time.

I very strongly dislike the general style.

0 comments

3 comments · 2 top-level

jeffhuys4y ago· 1 in thread

I guess that happens when you try to fit everything in 100 seconds.

Also, it will become almost literally impossible to hear the difference in the near future, so I personally wouldn't keep assuming it's one or the other.

Also also, did you give this feedback to Jeff?

chrismorganOP4y ago

> I guess that happens when you try to fit everything in 100 seconds.

What, you overshoot by 50%? :-) But more seriously, the job that has been done is simply low-quality, even disregarding the choppy edit style—probably a mixture of shoddy work (from comparing it with some of the others), and the creator not being capable of better as regards the speech—most aren’t particularly aware of how to improve. Much better is readily possible.

> Also, it will become almost literally impossible to hear the difference in the near future

I am an excellent reader, highly regarded for my diction and prosody and for conveying the sense of a matter. So far, I haven’t heard a TTS demo, hand-picked or otherwise, that would stand a chance against a skilled reader: the veriest dunce would savvy which was which within a couple of sentences. (And at least a couple of the demos were deliberately trying to address these sorts of shortcomings in inflection and such.) Certainly they’ve reached the stage where you often can’t reliably identify the computer versus the mediocre reader (and there are an awful lot of them), but I don’t think we’ll see computers beating skilled readers and speakers any time soon, when I consider how poor a job AI is still doing on coherent longform writing, and then add the degrees of nuance and information conveyed in speech. (Supplant them? Sure. But you don’t have to be better than something to supplant it, just cheaper, or some such thing. My favourite example of this is book binding where hot melt glue is vastly inferior to cold glue, but it’s much faster and thus cheaper to produce with, and it’s good enough that you’ll struggle to find cold glue ever used in production now.)

1 more reply

syzygyhack4y ago

Different strokes. I love Jeff's style and video pacing, though I don't think this is one of his best.

j / k navigate · click thread line to collapse