https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
Discussed at the time (2017) https://news.ycombinator.com/item?id=13768856
Clearly the result of a regulation that meant well. But the road to hell is paved with good intentions.
It's a bit reminiscent of a law that prevents institutions from continually offering employees non-permanent work contracts. As in, after two fixed-term contracts, the third one must be permanent. The idea is to guarantee workers more stable and long-term perspectives. The result, however, is that the employee's contract won't get renewed at all after the second one, and instead someone else will be hired on a non-permanent contract.
But the print form was even less accessible, and they kept publishing that...
Anything that's even remotely domain specific becomes a garbled mess. Even watching documentaries about light engineering/archeology/history subjects are hilariously bad. Names of historical places and people are randomly correct and almost always never consistent.
The second anyone has a bit of an accent then it's completely useless.
I keep them on partially because I'm of the "everything needs to have subtitles else I can't hear the words they're saying" cohort. So I can figure out what they really mean, but if you couldn't hear anything I can see it being hugely distracting/distressing/confusing/frustrating.
If the automatically-generated captions are now of a similar quality as human-generated ones, then that changes things.
[1] https://news.berkeley.edu/wp-content/uploads/2016/09/2016-08...
IME youtube transcripts are completely devoid of meaningful information, especially when domain-specific vocabulary is used.
It would be great if they were annotated and served in a more user-friendly fashion.
As a bonus link, one of my favorite courses from the time: https://archive.org/details/ucberkeley_webcast_itunesu_35482...
If it's, say, 5000 hours then through the best model at assembly.ai with no discounts it's cost less than $2000. I know someone could do whisper for cheaper, and there likely would be discounts at this rate but worst case it seems very doable even for an individual.
It is not perfect, it'd sometimes replace words with a synonym, but it is much faster and cheaper.
The low cost of Gemini 1.5 Flash-8B costs $1 per 500 hours of transcript.
Is building a ramp to meet ADA requirements not using technology to solve a legal issue?
1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."
2. LLMs want to format everything as internet text which does not align well to natural human speech.
3. Hallucinations still happen at scale, regardless of model quality.
We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.
This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.
And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.
There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.
We also compared Amazon, Google, Microsoft Azure as well as a bunch of smaller players (from Edinburgh and Cambridge) and - consistent with what you reported - we also found Google ranked worst - but that was a one-off study from 2019 (unpublished) on financial news.
Word Error Rate (WER), the standard metric for the tast, is not everything. For some applications, the ability to upload custom lexicons is paramount (ASR systems that are word-based (almost all) as opposted to phoneme based require each word to be defined ahead of being able to recognize said word).
Since you have experience in this, I’d like to hear your thoughts on a common assumption.
It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.
I guess a lot of it is a question of timing?
>the only hyperscaler with a good ASR is Azure
How would you say the non-hyperscalers compare? Speechmatics for example?
Whether you want to deal with it being annoying is your call.
One of the better reasons to use Cerebras/Groq that I've found so you can return huge amounts of clean text back fast for processing in other ways.
Are you saying it works with 70B models on Groq? Mixtral, Llama? Other?
https://github.com/google-gemini/generative-ai-js/issues/269...
I have had little success with Gemini and long videos. My pipeline is video -> ffmpeg strip audio -> whisperX ASR -> groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization. Works great.
Talking has a lot of nuances to it. Just try to read a Donald Trump transcript. A professional author would never write a book's dialogue like that.
Using a generic LLM on transcripts almost always reduces accuracy as a whole. We have endless benchmark data to demonstrate this at RevAI. It does, however, help with custom vocabulary, rare words, proper nouns, and some people prefer the "readability" of an LLM-formatted transcript. It will read more like a wikipedia page or a book as opposed to the true nature of a transcript, which can be ugly, messy, and hard to parse at times.
That's a bit too far. Ever read Huck Finn?
Once you accept it okay for the LLM to just replace words in a transcript, you might as well just let it make up a story based on character names you've provided.
That's a wild exaggeration. Professional transcripts often have small (and not so small) mistakes, caused by typos, mishearing or lack of familiarity with the subject matter. Depending on the case, these are then manually proofread, but even after proofreading, some mistakes often remain, and occasionally even introduced.
But I understand it can be difficult to trust: that’s why the project is on GitHub so you can run it on your own machine and look at how the key is used.
I will try to offer a version that doesn’t require any key.