Fine-grained Visual Transcription for YouTube videos (opens in new tab)

(vlm-docs.nos.run)

9 pointsEarlyOom2y ago3 comments

3 comments

3 comments · 1 top-level

EarlyOomOP2y ago· 2 in thread

TLDR: There are dozens of audio transcription APIs, but nothing for video and visual transcriptions. So we built one.

If you want visual chaptering, summarization, OCR / text-extraction, audio transcriptions, and sentiment analysis on your videos, there’s really nothing out there. We tried stitching this together with several audio/video understanding APIs but kept running into rate limits, hallucinations, high costs and poor accuracy.

Analyzing Audio Podcasts: https://vlm-docs.nos.run/guides/guide-audio-podcasts

Understanding Video Podcasts: https://vlm-docs.nos.run/guides/guide-video-podcasts

arthurdelerue2y ago

I'm not sure why you say that current video transcriptions are bad. I use Whisper on NLP Cloud for video transcription (https://docs.nlpcloud.com/#automatic-speech-recognition) and it works very well.

As far as I understand, video transcription is a no-brainer as long as you install ffmpeg.

EarlyOomOP2y ago

Hi Arthur! There's a bit of confusion here. It looks like you're referring to _audio_ transcription; that is, passing the audio component into an ASR pipeline (like Whisper, Otter etc.) to generate a transcript of any spoken words. Our pipleline is meant for fine-grained 'transcriptions' of the _visual_ content of the video. For instance, any text on screen, contents of plots and graphs, the clothing worn by any participants, etc. (though we do transcribe the audio as well, its a multimodal pipeline!).

j / k navigate · click thread line to collapse