97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".
Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.
Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.
(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)
Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.
So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.
NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.
Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.
The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.
With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.
Not having to pause + rewind will save a ton of time for that 3%.
ML systems somewhat notoriously do not necessarily make the same sorts of errors that a human would. And I'd expect a large portion of the errors to be transcribing the wrong words rather that indicating that a word couldn't be transcribed. That sort of error means that you can't really get away with manually reviewing just 3% of the audio.
And there are humans in the loop too, and an enormous amount of redundancy in the questions and answer, so even plausible false transcriptions will get picked up on if they matter. Nobody gets sent to jail simply because the transcription process - human or machine - accidentally substitutes "I did it" in place of "I didn't" midway through a two hour interview.
Humans, however, are not simply metric optimizers. Though it's always in the interest of those corporations producing metric optimizers (i.e. models) to paint humans as such, so their models shine in comparison. They want humans to look like bad machines, so it looks like they should be automated. Not to say they shouldn't in many cases, just that there's a clear one-sidedness in all corporate PR (and funded research, especially that research which is also PR).
All this to say that yes I agree with you. And if we humans don't want our unsustainable economic growth to turn us even more into machines (as our bureaucratic creep has done quite well thus far), we should fight such rhetoric that aims to paint humans simply as machines or task-doers.
1. Set up a computer with voice recognition software that flags certain patterns.
2. Connect computer to voice call communication network.
3. Configure computer to switch between calls every x number of seconds.
Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.
>The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.
Power always just finds a way to rationalize what it wants to do.