After All Is Said and Indexed – Unlocking Information in Recorded Speech (opens in new tab)

(github.com)

57 pointsjeadie3y ago13 comments

13 comments

13 comments · 5 top-level

jeadieOP3y ago· 2 in thread

A really interesting blog post I found using LLMs for audio search which I think is a pretty nifty/new idea.

I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems, but with Marqo it doesn't seem too hard.

thomasahle3y ago

> I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems

What parts are cumbersome?

jeadieOP3y ago

Most people, like me, who end up needing to use vector DBs, are wanting to use LLMs on a specific, often private dataset/use case. Typically one starts with something like unstructured JSON data, then need to pick and manage LLMs to create embeddings, then store these and the original JSON data in a vectorDB. Then the application is some variety of CRUD operations + searching over both the original data and the embeddings.

Chroma, Pinecone, I guess FAISS/HNSWlib/etc only handle vector operations. Really what I'd want, which Marqo does, is handle everything end to end.

notjulianjaynes3y ago· 2 in thread

This is interesting but what problem does it solve better than CTRL+F-ing a transcript? It seems like this would be a worse solution for when the precise way someone says something could be important (ex. journalists parsing an interview, students studying their recorded lectures) and that it would be most useful if you were working with a large volume of recorded audio, such as customer service calls. This makes me somewhat uncomfortable, but perhaps I am not fully understanding how it works.

Edit: wording

jeadieOP3y ago

Being able to handle and ask questions of audio data is a pretty big field. https://www.assemblyai.com/, for example, is a company entirely dedicated to audio intelligence. They have some great example use cases on their page.

UncleEntity3y ago

> This is interesting but what problem does it solve better than CTRL+F-ing a transcript?

Producing the transcript?

Being able to classify and search data seems like a pretty big deal these days too.

password43213y ago· 2 in thread

Both speaker and speech recognition are done in the article using huggingface.

Is there anything as good ready to use on-prem for the diarization (speaker recognition)?

I've heard good things about whisper(.cpp) for speech recognition and vosk used to be king of that hill...

rolisz3y ago

Diarization can be done on premise using pyannote (what they use in the article). Huggingface offers a library to run things locally and an API to run things on their cloud. Pyannote is available under an MIT licence

boredemployee3y ago

vosk is really good, but also a good example of an open source project with great potential, but doesn't scale up because the person behind it is a douchebag.

documentation is poor, and what you find is sparsed outdated shit on the web, so it's really hard to find help.

rektide3y ago· 1 in thread

Hadn't heard of the thing they were putting their data into, Marqo, a "tensor search for humans" , https://github.com/marqo-ai/marqo

jeadieOP3y ago

Its a great tool. Unlike vectorDBs alone, Marqo helps the full process that alot of people end up wanting to use vectorDBs for (e.g. have structured data, use LLMs to create embeddings, and perform search/CRUD on embeddings + original data).

moneywoes3y ago· 1 in thread

How does this compare to using Whisper and feeding that into a vector DB and querying with a LLM

Pardon the dumb question I only have an elementary understanding

jeadieOP3y ago

Not a dumb question at all! Essentially what can do Marqo, and this blog shows, is that there is alot of logic and work to do what you said (i.e. pass raw data into LLM, get embeddings, store in vector DB, then query both embeddings and original data).

j / k navigate · click thread line to collapse

13 comments

13 comments · 5 top-level

jeadieOP3y ago· 2 in thread

A really interesting blog post I found using LLMs for audio search which I think is a pretty nifty/new idea.

I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems, but with Marqo it doesn't seem too hard.

thomasahle3y ago

> I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems

What parts are cumbersome?

jeadieOP3y ago

Chroma, Pinecone, I guess FAISS/HNSWlib/etc only handle vector operations. Really what I'd want, which Marqo does, is handle everything end to end.

notjulianjaynes3y ago· 2 in thread

Edit: wording

jeadieOP3y ago

UncleEntity3y ago

> This is interesting but what problem does it solve better than CTRL+F-ing a transcript?

Producing the transcript?

Being able to classify and search data seems like a pretty big deal these days too.

password43213y ago· 2 in thread

Both speaker and speech recognition are done in the article using huggingface.

Is there anything as good ready to use on-prem for the diarization (speaker recognition)?

I've heard good things about whisper(.cpp) for speech recognition and vosk used to be king of that hill...

rolisz3y ago

boredemployee3y ago

vosk is really good, but also a good example of an open source project with great potential, but doesn't scale up because the person behind it is a douchebag.

documentation is poor, and what you find is sparsed outdated shit on the web, so it's really hard to find help.

rektide3y ago· 1 in thread

Hadn't heard of the thing they were putting their data into, Marqo, a "tensor search for humans" , https://github.com/marqo-ai/marqo

jeadieOP3y ago

moneywoes3y ago· 1 in thread

How does this compare to using Whisper and feeding that into a vector DB and querying with a LLM

Pardon the dumb question I only have an elementary understanding

jeadieOP3y ago

j / k navigate · click thread line to collapse