Thanks for the detailed reply. :-)
> As for the aggregation (grouping) algorithm, I'll just say that it's straight out of the textbook http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
So, in other words, you're using the MinHash algorithm as well as Locality-sensitive hashing (LSH)? How much volume are you able to process in how much time?
By the way, I first learned about this topic through Stanford’s “Mining of Massive Datasets” (MMDS) course that used to be free on Coursera. So it's thrilling to see someone put it to use in the real world and talk about it, too! :-)