undefined | Better HN

0 pointsthfuran3y ago0 comments

>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".

0 comments

anigbrowl3y ago

No it isn't. That just means 2-3% of your content needs to be double-checked by a person at the audio level, saving huge amounts of time - equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.

Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.

wging3y ago

Doesn't it mean 100% of your content needs to be double-checked? You can't easily identify which 2-3% of your content has errors. I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)

anigbrowl3y ago

By the time you're prosecuting someone in court, yes of course you double, triple, quadruple check everything. That's why lawyers get paid the big bucks (for now...). But yes you can identify which content probably has errors and flag it as such.

Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.

So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.

NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.

2 more replies

etienne6183y ago

Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.

woah3y ago

You double check things that you think are important, in this case, passages that will be used as evidence in court.

6gvONxR4sf7o3y ago

> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.

vivegi3y ago

You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.

The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.

u83y ago

I had to do a lot of manual transcription in Journalism school. Using a tool like Descript saved HOURS of my life. Generally it was 80% accurate, but going over an two-hour-long recording again at 3x speed while reading over the transcript, fixing errors from memory or pausing took a five hour job down to 30-40 minutes. Either way, somebody is going to have to listen to the recording. This just removes a layer of grunt work.

TheCapeGreek3y ago

Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.

guelo3y ago

Maybe you could run the text through a grammar checker to identify the errors.

1 more reply

thfuranOP3y ago

>equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

ML systems somewhat notoriously do not necessarily make the same sorts of errors that a human would. And I'd expect a large portion of the errors to be transcribing the wrong words rather that indicating that a word couldn't be transcribed. That sort of error means that you can't really get away with manually reviewing just 3% of the audio.

notahacker3y ago

ML tending to make weird mistakes rather than subtle ones that make sense in context like human transcribers is likely to make them easier to spot.

And there are humans in the loop too, and an enormous amount of redundancy in the questions and answer, so even plausible false transcriptions will get picked up on if they matter. Nobody gets sent to jail simply because the transcription process - human or machine - accidentally substitutes "I did it" in place of "I didn't" midway through a two hour interview.

1 more reply

iroh27273y ago

+1. There is a widespread "metric fallacy" or "task fallacy" going around. Models of course optimize for metrics, so they tend to perform well on those related metrics.

Humans, however, are not simply metric optimizers. Though it's always in the interest of those corporations producing metric optimizers (i.e. models) to paint humans as such, so their models shine in comparison. They want humans to look like bad machines, so it looks like they should be automated. Not to say they shouldn't in many cases, just that there's a clear one-sidedness in all corporate PR (and funded research, especially that research which is also PR).

All this to say that yes I agree with you. And if we humans don't want our unsustainable economic growth to turn us even more into machines (as our bureaucratic creep has done quite well thus far), we should fight such rhetoric that aims to paint humans simply as machines or task-doers.

datalopers3y ago

If you know which 2-3% are the false positives, you have a very lucrative business model.

MonkeyMalarky3y ago

When doing validation, I find it will often be the same errors repeated again and again in a transcription. Like it will fail on someone or some thing's name (that is rare / unique) and map it onto a known similar sounding word.

dotancohen3y ago

Sometimes even human will disagree about what was said in a recording - I had this happen recently. I heard a specific sentence, the other person heard the exact opposite. I cannot say who was right, even after listening to the recording several times on headphones and speakers I'm as certain of my interpretation as was the other party.

gnramires3y ago

I think an [UNINTELLIGIBLE] indication would be a great addition to automatic transcription systems.

inanutshellus3y ago

It'd [UNINTELLIGIBLE score="92%" alternatives="pro-rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to make a markup-based output... though you'd probably find it gave you more info than you wanted.

1 more reply

yencabulator3y ago

Google Voice voicemail transcription used to do this, with varying levels of gray. It seems that feature is gone, now.

gs173y ago

Yeah, I tried to use automated transcription for a research project and we had to do it all manually because the few errors (I would say it did pretty well given our recording quality) were often dropping words like "not", which changed the whole meaning of a sentence! It was a useful assistance during transcription, but I really hope they would verify it was correct before arresting anyone based on it.

hadlock3y ago

Microsoft announced their voice transcription technology a couple years ago and were also touting ~97-98% accuracy which was actually better than human transcription error rates. The errors are usually in part people garbling their own speech, or they move their head while talking and the microphone misses a syllable. Anything in that error bar would probably fall under "reasonable doubt"

kyriakos3y ago

If its anything like Microsoft teams transcription I doubt the 97%+ accuracy.

selfmodruntime3y ago

I've worked with similar technology in the law enforcement space and the software is never used to make decisions. You can make out critical timestamps in conversations and a law enforcement officer will always manually confirm the software's assessments.

JohnFen3y ago

Given that law enforcement has made similar claims about technology use in the past that turned out to be false, I have no faith in this claim.

selfmodruntime3y ago

In all honesty, this is the correct mindset to have. I have limited expertise in this topic, and you should be aware that other law enforcement agencies probably do not handle this the same way.

CTDOCodebases3y ago

I imagine a certain percentage of a given population is on a voice call at any one time.

1. Set up a computer with voice recognition software that flags certain patterns.

2. Connect computer to voice call communication network.

3. Configure computer to switch between calls every x number of seconds.

Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.

NaturalPhallacy3y ago

This is called "a fishing expedition" and is wildly unconstitutional in the US.

>The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

CTDOCodebases3y ago

Are you sure about that? [0]

Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.

[0] - https://en.wikipedia.org/wiki/Jewel_v._NSA

jjoonathan3y ago

Yes, it is wildly unconstitutional, but in practice don't the courts endorse the asinine "it's not a search unless we find something" argument from the NSA?

Power always just finds a way to rationalize what it wants to do.

1 more reply

Thorentis3y ago

Not really. Imagine that they do simple keyword matching on the text. Anything that's missed (part of the 97%) the criminals get away with. Anything that matches in the 3% is then checked by a human (by listening to the audio at that time stamp). So you only need to manually check the 3%, and even then only if something you're interested in is found.

golem143y ago

One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.

thfuranOP3y ago

You have absolutely ruined someone's day way before they're sitting in front of a jury.

formerly_proven3y ago

Stuff like that is a very good tell that someone has zero experience with law enforcement.

j / k navigate · click thread line to collapse

0 comments

anigbrowl3y ago

wging3y ago

anigbrowl3y ago

2 more replies

etienne6183y ago

Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.

woah3y ago

You double check things that you think are important, in this case, passages that will be used as evidence in court.

6gvONxR4sf7o3y ago

> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.

vivegi3y ago

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.

u83y ago

TheCapeGreek3y ago

Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.

guelo3y ago

Maybe you could run the text through a grammar checker to identify the errors.

1 more reply

thfuranOP3y ago

>equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

notahacker3y ago

ML tending to make weird mistakes rather than subtle ones that make sense in context like human transcribers is likely to make them easier to spot.

1 more reply

iroh27273y ago

+1. There is a widespread "metric fallacy" or "task fallacy" going around. Models of course optimize for metrics, so they tend to perform well on those related metrics.

datalopers3y ago

If you know which 2-3% are the false positives, you have a very lucrative business model.

MonkeyMalarky3y ago

dotancohen3y ago

gnramires3y ago

I think an [UNINTELLIGIBLE] indication would be a great addition to automatic transcription systems.

inanutshellus3y ago

1 more reply

yencabulator3y ago

Google Voice voicemail transcription used to do this, with varying levels of gray. It seems that feature is gone, now.

gs173y ago

hadlock3y ago

kyriakos3y ago

If its anything like Microsoft teams transcription I doubt the 97%+ accuracy.

selfmodruntime3y ago

JohnFen3y ago

Given that law enforcement has made similar claims about technology use in the past that turned out to be false, I have no faith in this claim.

selfmodruntime3y ago

In all honesty, this is the correct mindset to have. I have limited expertise in this topic, and you should be aware that other law enforcement agencies probably do not handle this the same way.

CTDOCodebases3y ago

I imagine a certain percentage of a given population is on a voice call at any one time.

1. Set up a computer with voice recognition software that flags certain patterns.

2. Connect computer to voice call communication network.

3. Configure computer to switch between calls every x number of seconds.

Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.

NaturalPhallacy3y ago

This is called "a fishing expedition" and is wildly unconstitutional in the US.

CTDOCodebases3y ago

Are you sure about that? [0]

Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.

[0] - https://en.wikipedia.org/wiki/Jewel_v._NSA

jjoonathan3y ago

Yes, it is wildly unconstitutional, but in practice don't the courts endorse the asinine "it's not a search unless we find something" argument from the NSA?

Power always just finds a way to rationalize what it wants to do.

1 more reply

Thorentis3y ago

golem143y ago

One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.

thfuranOP3y ago

You have absolutely ruined someone's day way before they're sitting in front of a jury.

formerly_proven3y ago

Stuff like that is a very good tell that someone has zero experience with law enforcement.

j / k navigate · click thread line to collapse