Entity Resolution: The most common data science challenge (opens in new tab)

(docs.magniv.io)

85 pointsjonpon4y ago26 comments

26 comments

23 comments · 8 top-level

smeeth4y ago· 8 in thread

Danger awaits all ye who enter this tutorial and have large datasets.

The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.

ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:

1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.

2) Use fancy ML to determine matches within these blocks.

If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering.

jandrewrogers4y ago

So much this. I work on spatiotemporal entity resolution and it requires extreme care and cleverness to not turn it into a proverbial "boiling the ocean" compute problem for all but the most trivial cases. The implied state that needs to be managed alone is often effectively intractable. At one time supercomputers were used for large-scale entity resolution problems; they may still.

code_biologist4y ago

The beautiful thing about spatiotemporal entities is that physical time provides "out of the box common sense" decent locality, and physical space excellent locality, for the blocking process your parent comment mentions.

It sucks to work on a whole product catalog or rolodex.

staticassertion4y ago

Cool, TIL about blocking/ filtering. We do entity resolution such that if we have a "process id" (ex: 1234) in two different event logs we can determine if they're the same process (since pids get reused). We don't have to compare every process to every other process, only ones that share the same pid, which drastically reduces the data size, and then we do filtering on top of that based on other optional attributes.

When I wrote this system I didn't know what ER even was, an article like this would have helped a lot, even just the first line defining ER would have helped a lot.

jonponOP4y ago

You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience.

The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise"

How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial

smeeth4y ago

Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.

The point I was trying to make is that at scale one does not simply:

> compare these embeddings with all the other embeddings you have

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.

1 more reply

rkwz4y ago

This is my favourite paper on the different indexing (blocking) techniques - http://users.cecs.anu.edu.au/~Peter.Christen/publications/ch...

Very easy to read and explains the tradeoffs of the algorithms

pfisherman4y ago

Great advice! What is your rough guesstimate - in big O - of a general least upper bound of what you can do with ER?

smeeth4y ago

Caveat: just my personal hunch.

My suspicion is O(kn), where k is both constant and rather large. This would be highly reliant on an extremely good blocking algorithm. Maybe trending towards O(nlogn)? I don't know. It's really tough.

The blocking algorithm for this hypothetical solution would most likely be an approximate hashing algorithm sort of resembling a bloom filter. The input data would have to be amenable to hashing, like a single string field. Add multiple fields and shit just got complicated fast. Look up q-gram blocking if you want to read a sort of simple example of a blocking algorithm like this.

1 more reply

ropeladder4y ago· 3 in thread

Entity resolution/record linkage/deduplication is an oddly specialized domain of knowledge given that it's such a common problem. I put together a page of resources a while back if anyone is interested: https://github.com/ropeladder/record-linkage-resources

chrisoakman4y ago

I gave a talk about Record Linking at Clojure/conj 2019: https://www.youtube.com/watch?v=rGKEOMUtJfE

I opened a PR to add it to your list

ropeladder4y ago

Merged! Thanks for the addition, looks like an interesting talk with some good real-world lessons.

jonponOP4y ago

This is amazing! Thanks for sharing :)

fastaguy884y ago· 2 in thread

Sorry to post on at topic I know nothing about.

To me, this looks very similar to local sequence similarity search (e.g. BLAST), where there are very rapid methods that use tuple-lookup and banded alignment to quickly identify "homologs" (the same entity). The nice thing about similarity searching algorithms is that they give you a very accurate probability of whether two strings are "homologous" (belong to the same entity). Perhaps I have the scale wrong, but it is routine to look for thousands of queries (new entities) among hundreds of millions of sequences (known entities) in an hour or so (and sequences are typically an order of magnitude longer than entity names). The problem is embarrassingly parallel, and very efficient algorithms are available for entities that are usually very similar.

jonponOP4y ago

Very interesting comparison -- do you have a TLDR of how BLAST works or a good resource to read more about it?

fastaguy884y ago

BLAST is so deeply embedded in modern genome biology that it is hard to find short descriptions. Basically, there are two parts:

(1) local sequence (string) similarity scores (scores that allow mismatches, but need not extend to the end of either string) are distributed as the "extreme value distribution", which means that it is very easy to distinguish between similarity by chance and similarity because of common ancestry (I think of the names for the same entity as ancestral). This is a very powerful concept, that allows one to accurately set false-positive rates (it does not help with false negatives)

(2) BLAST calculates local similarity scores at O(n^2)/K where K is very very large by first identifying significant conserved "words" using a lookup table, then looking for runs of those words along an alignment diagonal, and then trying to extend that diagonal using an efficient banded dynamic programming algorithm.

mynegation4y ago· 1 in thread

I remember helping my little sister who got entity resolution (people’s names and company names) homework assignment for programming class 26 years ago (she is economics major and I am CS). That was infuriating and intellectually challenging at the same time. We came up with a combination of n-grams, Levenshtein distance, and common abbreviation (think “Inc.” and “Corp.”) canonicalization. It worked reasonably well.

jonponOP4y ago

The reason why I love this problem is because of this! I feel like there are a lot of fun ways to be creative here, but as the other comments mentioned -- to get a scalable and really good solution is extremely difficult.

benmanns4y ago· 1 in thread

What are people doing with entity resolution/record linkage? At Doximity we use it to match messy physician, hospital, and medical publication data from various sources into more coherent data sets to power profiles and research tools. Mostly with https://dedupe.io/ but with some custom tooling to handle larger (1m+ entities) datasets.

nwatson4y ago

Replacing sensitive entity content with pseudorandom seeded junk for subsequent training and transforms in exposed settings. Not the main use case for ER.

stevesimmons4y ago

A good starting point for Entity Resolution/Deduplication is the Python Dedupe project [1, 2] and the PhD thesis on whose work it is based [3]

[1] https://github.com/dedupeio/dedupe

[2] https://dedupe.io/

[3] http://www.cs.utexas.edu/~ml/papers/marlin-dissertation-06.p...

visarga4y ago

I was expecting the use of sentence-transformers (sbert.net). If you have a long list of entities you could use an approximate similarity search library such as annoy. The authors store the embeddings in a database and decode json for each comparison. Very inefficient in my opinion. At least load the whole table of embeddings in a np.array from the start, np.dot is plenty fast if your list not huge.

The problem is still not solved. Having a list of the most similar entities does not tell you which are similar and which are just related. You need a classifier. For that you can label a few hundred pairs of positive and negative examples and use the same sbert.net to finetune a transformer. The authors use the easier route of thresholding cosine similarity score at 0.8, but this threshold might not work for your case.

markjspivey4y ago

this doesn't make any sense though, your write up starts with an assumption that multiple records in the dataset are "obviously" the same entity ... so we wouldn't even need entity resolution then ...

"entity resolution" as a process should moreso imply how to determine whether two records that aren't obviously the same entity are actually the same entity ... and how you would go about discovering and proving and declaring that ...

j / k navigate · click thread line to collapse

26 comments

23 comments · 8 top-level

smeeth4y ago· 8 in thread

Danger awaits all ye who enter this tutorial and have large datasets.

The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.

ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:

2) Use fancy ML to determine matches within these blocks.

jandrewrogers4y ago

code_biologist4y ago

It sucks to work on a whole product catalog or rolodex.

staticassertion4y ago

When I wrote this system I didn't know what ER even was, an article like this would have helped a lot, even just the first line defining ER would have helped a lot.

jonponOP4y ago

You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience.

How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial

smeeth4y ago

Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.

The point I was trying to make is that at scale one does not simply:

> compare these embeddings with all the other embeddings you have

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

1 more reply

rkwz4y ago

This is my favourite paper on the different indexing (blocking) techniques - http://users.cecs.anu.edu.au/~Peter.Christen/publications/ch...

Very easy to read and explains the tradeoffs of the algorithms

pfisherman4y ago

Great advice! What is your rough guesstimate - in big O - of a general least upper bound of what you can do with ER?

smeeth4y ago

Caveat: just my personal hunch.

1 more reply

ropeladder4y ago· 3 in thread

chrisoakman4y ago

I gave a talk about Record Linking at Clojure/conj 2019: https://www.youtube.com/watch?v=rGKEOMUtJfE

I opened a PR to add it to your list

ropeladder4y ago

Merged! Thanks for the addition, looks like an interesting talk with some good real-world lessons.

jonponOP4y ago

This is amazing! Thanks for sharing :)

fastaguy884y ago· 2 in thread

Sorry to post on at topic I know nothing about.

jonponOP4y ago

Very interesting comparison -- do you have a TLDR of how BLAST works or a good resource to read more about it?

fastaguy884y ago

BLAST is so deeply embedded in modern genome biology that it is hard to find short descriptions. Basically, there are two parts:

mynegation4y ago· 1 in thread

jonponOP4y ago

benmanns4y ago· 1 in thread

nwatson4y ago

Replacing sensitive entity content with pseudorandom seeded junk for subsequent training and transforms in exposed settings. Not the main use case for ER.

stevesimmons4y ago

A good starting point for Entity Resolution/Deduplication is the Python Dedupe project [1, 2] and the PhD thesis on whose work it is based [3]

[1] https://github.com/dedupeio/dedupe

[2] https://dedupe.io/

[3] http://www.cs.utexas.edu/~ml/papers/marlin-dissertation-06.p...

visarga4y ago

markjspivey4y ago

j / k navigate · click thread line to collapse