I've seen some people discuss on here about some techniques they have used to get more accuracy from using embeddings to perform search. For example, I have seen some suggest that extracting keywords from the text and creating an embedding out of those can work better. I was wondering if anyone had any articles testing these methods out; or if there were other methods people know about.
I'm interested in a model that can take as input a video and output a caption to describe what is happening in the video. I've looked on huggingface etc. and can only find XCLIP from Microsoft, but that only does video classification. It doesn't write its own caption.
I want to build a site for publishing articles and some personal documents as blogs and wanted some inspiration on how the best designers on the web do it.