The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.
ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:
1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.
2) Use fancy ML to determine matches within these blocks.
If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering.
It sucks to work on a whole product catalog or rolodex.
When I wrote this system I didn't know what ER even was, an article like this would have helped a lot, even just the first line defining ER would have helped a lot.
The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise"
How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial
The point I was trying to make is that at scale one does not simply:
> compare these embeddings with all the other embeddings you have
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.
Very easy to read and explains the tradeoffs of the algorithms
My suspicion is O(kn), where k is both constant and rather large. This would be highly reliant on an extremely good blocking algorithm. Maybe trending towards O(nlogn)? I don't know. It's really tough.
The blocking algorithm for this hypothetical solution would most likely be an approximate hashing algorithm sort of resembling a bloom filter. The input data would have to be amenable to hashing, like a single string field. Add multiple fields and shit just got complicated fast. Look up q-gram blocking if you want to read a sort of simple example of a blocking algorithm like this.
I opened a PR to add it to your list
[1] https://github.com/dedupeio/dedupe
[3] http://www.cs.utexas.edu/~ml/papers/marlin-dissertation-06.p...
The problem is still not solved. Having a list of the most similar entities does not tell you which are similar and which are just related. You need a classifier. For that you can label a few hundred pairs of positive and negative examples and use the same sbert.net to finetune a transformer. The authors use the easier route of thresholding cosine similarity score at 0.8, but this threshold might not work for your case.
To me, this looks very similar to local sequence similarity search (e.g. BLAST), where there are very rapid methods that use tuple-lookup and banded alignment to quickly identify "homologs" (the same entity). The nice thing about similarity searching algorithms is that they give you a very accurate probability of whether two strings are "homologous" (belong to the same entity). Perhaps I have the scale wrong, but it is routine to look for thousands of queries (new entities) among hundreds of millions of sequences (known entities) in an hour or so (and sequences are typically an order of magnitude longer than entity names). The problem is embarrassingly parallel, and very efficient algorithms are available for entities that are usually very similar.
(1) local sequence (string) similarity scores (scores that allow mismatches, but need not extend to the end of either string) are distributed as the "extreme value distribution", which means that it is very easy to distinguish between similarity by chance and similarity because of common ancestry (I think of the names for the same entity as ancestral). This is a very powerful concept, that allows one to accurately set false-positive rates (it does not help with false negatives)
(2) BLAST calculates local similarity scores at O(n^2)/K where K is very very large by first identifying significant conserved "words" using a lookup table, then looking for runs of those words along an alignment diagonal, and then trying to extend that diagonal using an efficient banded dynamic programming algorithm.
"entity resolution" as a process should moreso imply how to determine whether two records that aren't obviously the same entity are actually the same entity ... and how you would go about discovering and proving and declaring that ...