This first part is the high-level introduction, useful for project planning and architecture decisions that need to be made early in the development process. Any feedback is welcome, along with wishes for the follow-up parts if you have something specific that you would like to be covered.
Do you have any materials on word embedding strategies past Word2Vec? BERT and beyond?
I am currently working on a recommendation engine for a large library - original idea being to find "similar" documents - the funding comes from a plagiarism checking project.
I was slightly surprised how deceptively simple the widely cited winnowing paper is https://dl.acm.org/doi/10.1145/872757.872770 . The key idea being simple mod reduction of hashed fingerprints.
My project's goal is to find phrase level similarities to assist researchers.
It seems k-grams, n-grams, tf-idf and even Word2Vec is not going to cut it. A "smarter" context aware embedding is in order. My foray in training BERT from scratch was not very successful. - My corpora are not in English...
PS. As usual I spend most of the time on improving OCR quality and preprocessing corpora...
But a better bet might be actually turning looking into simpler embedding and methods and attempting to directly improve them by including some domain knowledge in the method or the process. Again, it is hard to judge what might work better just looking at the surface.
In case you really need to work with labeled datasets, set up a strong baseline, look into the active-learning methods and set up the loop, do a few iterations and try to predict if it will scale sufficiently fast to your target accuracy.
A great upside to this approach is that it works for a variety of different types of unstructured data (images, video, molecular structures, geospatial data, etc), not just text. The rise of multimodal models such as CLIP (https://openai.com/blog/clip) makes this even more relevant today. Combine it with a vector database such as Milvus (https://milvus.io) and you'll be able to do this at scale with very minimal effort.
[1] https://www.nyckel.com/semantic-image-search [2] https://www.nyckel.com/docs/text-search-quickstart
This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text.
[1] github.com/facebookresearch/faiss
[2] github.com/google-research/google-research/tree/master/scann
[3] www.pinecone.io
In the structural-comparison case, I imagine you might have better luck with just doing cosine similarity across the term frequency vector or some such, possibly doing random projection first to reduce dimensionality.
Or really, an LSH would do the trick.
- https://dzone.com/articles/build-a-plagiarism-checker-using-...
We used something similar to build a “similar articles” feature & it gave us de-duplication essentially for free.