Listening with LLM (opens in new tab)

(paul.mou.dev)

132 pointsppymou2y ago11 comments

11 comments

10 comments · 3 top-level

modeless2y ago· 4 in thread

I love this research direction! Multimodal is the future and the possibilities of gluing together pretrained models are under explored. As tinkerers it's something we can do at home that doesn't require a datacenter full of H100s or a terabyte dataset.

Crazy that you were able to trace your issues to bad RAM! I probably would have torn all my hair out long before suspecting bad RAM.

I imagine that Whisper based embeddings wouldn't be great for analyzing music but they should be excellent for allowing LLMs to understand speech. Although it might seem trivial to hook up Whisper to LLMs already using text, I think using embeddings instead (or in addition) would allow the LLM to understand much more about speech. Cadence, tone, accent, etc. I think something like this will be necessary for speech agents in the medium term. It should allow a LLM to respond much more naturally to speech input, vs. just giving it the text output of a speech to text system. Maybe it could be done on the output side too, hooking it up to the internals of a text-to-speech system for an end-to-end audio-to-audio chatbot!

Do you have a Twitter account or some other way to follow your progress?

ppymouOP2y ago

Thanks for the comment. I will be chronicling my progress on the blog!

modeless2y ago

BTW I submitted your post to /r/LocalLLaMA. I expect it will see some good attention there. It seems to be the best community around (outside of various Discords I don't participate in) for people doing ML at home, even endorsed by Andrej Karpathy himself. https://old.reddit.com/r/LocalLLaMA/comments/1970zhf/merging...

1 more reply

rcarmo2y ago

The blog has RSS.

modeless2y ago

I don't have an RSS reader these days. I found that I just wasn't keeping up with all the subscriptions and I wasn't getting that much out of reading them anyway. Something with a little bit of collaborative filtering like HN or social media works better for me.

asymmetric2y ago· 2 in thread

Very OT, but I love the style of your resume. Is the source available somewhere?

ppymouOP2y ago

Haha thanks; the html is on Github https://github.com/moomou/moomou.github.io/blob/master/resum... and from there you can see the imported css etc.; be warned though, the resume & css have been accumulated over the years so they are not particularly clean

asymmetric2y ago

Thanks! I was hoping I'd discover some nice LaTex class, but alas :)

refulgentis2y ago· 1 in thread

If author is around: amazing work!!! Multimodal from scratch :)

I'm curious if you have the test clip you use, I got to the end and was like "wait....is that a good result! The words are completely different!"

Then I re-read a couple times scanning carefully for references to what the audio is.

This quote[^1] makes me think the sample is music, as that would explain why the end result is good -- it's trying to describe a sound file of just music, not a sound file that is a spoken word version of the "ground truth":

[^1] "For dataset, I chose MusicCaps. I did not see any convenient links to download processed/segmented audio files, so I wrote a small script to download the Youtube videos."

ppymouOP2y ago

Thanks for reading and yes you are right, the input audios are clips of music;

MusicCaps [1] is a dataset containing pairs of music audio and natural language description of the clip; the reason why the result is good imo is because the trained model was able to generate a description with features of the ground truth

[1] https://huggingface.co/datasets/google/MusicCaps

j / k navigate · click thread line to collapse

11 comments

10 comments · 3 top-level

modeless2y ago· 4 in thread

Crazy that you were able to trace your issues to bad RAM! I probably would have torn all my hair out long before suspecting bad RAM.

Do you have a Twitter account or some other way to follow your progress?

ppymouOP2y ago

Thanks for the comment. I will be chronicling my progress on the blog!

modeless2y ago

1 more reply

rcarmo2y ago

The blog has RSS.

modeless2y ago

asymmetric2y ago· 2 in thread

Very OT, but I love the style of your resume. Is the source available somewhere?

ppymouOP2y ago

asymmetric2y ago

Thanks! I was hoping I'd discover some nice LaTex class, but alas :)

refulgentis2y ago· 1 in thread

If author is around: amazing work!!! Multimodal from scratch :)

I'm curious if you have the test clip you use, I got to the end and was like "wait....is that a good result! The words are completely different!"

Then I re-read a couple times scanning carefully for references to what the audio is.

[^1] "For dataset, I chose MusicCaps. I did not see any convenient links to download processed/segmented audio files, so I wrote a small script to download the Youtube videos."

ppymouOP2y ago

Thanks for reading and yes you are right, the input audios are clips of music;

[1] https://huggingface.co/datasets/google/MusicCaps

j / k navigate · click thread line to collapse