More info about synonyms at Google (2010) | Better HN

50 comments

31 comments · 6 top-level

bms2297OP2y ago· 13 in thread

One of the most important components of Pre-2010 Google's search system was its synonym discovery mechanism. Simply put, queries would be "expanded" with synonyms. Google automatically generated synonym choices that took into account the context of surrounding words, with the understanding that synonyms are highly context dependent. Steven Baker, John Lamping, and a couple of others were key engineers of the system.

Does anyone with a NLP background care to take some guesses on how the synonym extraction methodology worked? My only piece of information is that it likely used the query log itself to do so.

evmar2y ago

I was on the team too, with less impact than the names you mentioned. The team filed a number of patents that describe parts of how it worked. You can query a patent search engine with terms like [baker synonyms]. Looking now I think Steve was on most of the patents and you can also gather adjacent coauthor names from there.

[I am not a fan of patents, but to the extent they have any positives they in principle serve to share knowledge about how inventions work. Also I am not a lawyer but I think patents last 20 years from filing date and these were filed ~20 years ago maybe?]

robrenaud2y ago

I got a couple patents while at Google. I sent a nice readable 4 page design doc that I wrote to a patent lawyer, and I got back 40 pages of nonsense that I basically didn't understand.

I wish there was some kind of readability requirement for patents , if they are to continue to exist.

bms2297OP2y ago

Very cool that you worked on it! I've found most, I think, of the patents. They are.... as has been hinted at in this thread, very difficult to parse through and (imo) don't actually reveal much, though I may just lack the expertise! That's why I was hoping to get some NLP folks to speculate!

Your point on dates is something I did want to call out - I wouldn't be asking this if it wasn't ancient history. I have no interest in doing anything sinister. Just trying to explore a fun part of Internet history. Any shot I could shoot you an email to chat?

RaoulP2y ago

I just realised that this technique is absent from local/desktop search. Meaning that in most systems you’re expected to recall how something was phrased, if you want to have a chance of finding it.

I know “Google Desktop” used to be a product years ago. What’s the state of that space today?

> in principle serve to share knowledge about how inventions work

Emphasis on "in principle". Most patents - especially software patents - are completely unintelligible. They also tend to describe the system enough that they can sue people that do the same thing, but no-where near enough that you could actually implement it based on the patent.

choppaface2y ago

Also under-rated feature of 2010-era search was Matt Cutts, author of the article. He was an outlier at Google in that he did real community engagement as well as anti-spam, which is a huge contrast to today’s Google and how the internet has reacted to present-day SEO.

While the Matt Cuts era search tech is interesting, it’s crucial to keep in mind that the dataset was very different then too as a result of Matt Cutts’ own attitude towards spam and SEO.

Back in 2010 LDA was big and Google had used probabilistic networks e.g. Rephil / large noisy-OR networks as models

https://uh.edu/nsm/computer-science/events/seminars/2016/110...

Would the same things work today given how SEO spam and Google ads work? The same models are probably useful but it’s the noise and the long tail of the data that makes the problem hard.

bpiche2y ago

I was a fan of LDA but would not agree that it is 'probably useful' today. It's an unsupervised clustering algorithm based on Gibbs sampling. Like k-means, it's gonna return a few buckets that will have to be reviewed by a human for data exploration. In this case instead of neatly labeled buckets, these are unlabeled distributions of distributions (lists of single word tokens). If you do some kind of multiword tokenization preprocessing, it'll return a few lists of words and multiword tokens for each document. How is this useful to an end user? Even internally, they're not useful embeddings/vectorizations. Would love to hear some contrary opinions

jjtheblunt2y ago

Around 2001 i was using Wordnet to do the same in Motorola Labs days.

https://en.wikipedia.org/wiki/WordNet

Wordnet is insufficient for disambiguation, right? That’s why you need the query log.

cdavid2y ago

If you have access to the query log, aka "who makes which query in what context"), you can use see which queries are "close" to others in context.

For example, with session, you can detect manual query rewriting, and use this as a signal to see which queries are close to others in the time context. You can do various fancy things from just that.

Nowadays, a simple way to start would be to use SOTA LLMs to generate synonyms offline, and use this for query expansion at query time. At least in a context where queries are small, that should give decent results. This has however diminishing returns because of cost (the more synonyms the more expensive querying the index), and also you lose precision with diminishing returns on increased recall.

Ofc, for complex search like google, I am sure it is much more complicated

bms2297OP2y ago

Re: LLMs, I was trying to better understand how pre-LLM search worked, hence the interest in the topic.

Any chance you have any open source links that discuss how you practically operate a system based on the concept you describe (manual query rewrite w/i a session as your data set)? Perhaps it's obvious to an NLP person how to reduce that "idea" to practice, but it is not to me!

You're definitely right about the idea though - a former Search engineer obliquely mentioned that this sort of session based manual query rewriting was very core to how the synonym system worked.

bpiche2y ago

Maybe pointwise mutual information (pmi)

bms2297OP2y ago

Say more!? :)

dkjaudyeqooe2y ago· 10 in thread

> A lot of people seem to think that Google only does simple-minded matching of the users’ keywords with words that we indexed

Oh what a dream if that were true! Instead, every year, the 'synonyms' get broader and broader. To me it looks like they're using synonyms of synonyms (of synonyms).

One thing is surely true: Google abhors a vacuum, they will show you results, no matter how tenuously connected to your query.

> Oh what a dream if that were true! Instead, every year, the 'synonyms' get broader and broader. To me it looks like they're using synonyms of synonyms (of synonyms).

Forget about synonyms of synonyms - I've seen antonyms bolded as matches in google search results. I have to imagine I'm not the only one.

userbinator2y ago

I've also seen a few cases where it decided to cross out the word "not", basically inverting all the results.

Viliam12342y ago

Searching "how to do X on Linux" reliably returns "how to do X on Windows" and vice versa.

There seems to be a part of algorithm smart enough to figure out that both Linux and Windows are operating systems, but not smart enough to realize that the difference usually matters a lot for the person asking the question.

marginalia_nu2y ago

While I agree that Google's query interpretation as it works right now is annoying and frustrating more than it is helpful, that's largely owing to how inscrutable it is. If you search for cats and get dogs, it's not obvious why that happened and not obvious how to prevent it from happening. That problem is in no way intrinsic to synonym generation, but likely a result of leaning too much into embeddings to do the heavy lifting.

That said, skipping synonyms and other query variant generation is definitely throwing the baby out with the bathwater. When it works well it massively increases the recall of the search engine at very little loss of precision, which is important given the scale and noisiness of web results.

dkjaudyeqooe2y ago

> That said, skipping synonyms and other query variant generation is definitely throwing the baby out with the bathwater.

I disagree, I want control back. They've neutered almost every query narrowing option they used to have. They don't care anymore they just want to give you results.

I'm not saying the, lets call it 'query automation', is always bad, I just want the option to turn it off to some degree. Apparently we can no longer be trusted with that.

I don't know why this is downvoted, it's overstating things a little but there's some truth to it for sure. The other day I was googling for the name of the \mid Latex math symbol. It's basically a pipe with some space around it, a separator line useful in definitions.

So I tried to google various variations of "separator line latex math" but it always synonymed them to include "line break latex", which is obviously "\\" and not what I want.

I've found Google totally useless for all things LaTeX and resorted to using LLMs.

dkjaudyeqooe2y ago

Some people can't handle, or don't recognize, hyperbole.

It was written, and was meant to be read in, an exasperated and slightly dramatic tone. I thought that was obvious from the opening sentence, but apparently not.

wizzwizz42y ago

Fwiw, the usual method for finding LaTeX symbols is Detexify: https://detexify.kirelabs.org/classify.html

For what it’s worth Kagi gets what you wanted (I think) in the first four results for your first search.

mmastrac2y ago· 1 in thread

Given this was published in 2010, and https://en.wikipedia.org/wiki/Word2vec was published in 2013, perhaps this was an early precursor?

From the article linked from this blog post: "Enabling computers to understand language remains one of the hardest problems in artificial intelligence."

I worked for this task for a year and it doesn't work very well because in embedding space relatedness, synonymy and antonymy are mixed up and require pairwise thresholding. You can probably get to 90% but not 99% this way. Better use a crossentropy approach.

In modern RAG applications we return top-k results for this reason - it can't simply give the correct snippet in one result, leaving the hard part to the LLM to make sense what is useful and what is not.

1970-01-012y ago· 1 in thread

Not ignoring the deliberate use of verbatim operators (text in quotes), and delivering verbatim results, just as they were doing a decade ago, would be a fantastic improvement. I've found this problem in every search engine, and it's infuriating. Quick example:

    "A+Z" becomes "A-Z":

https://www.google.com/search?q="A%2BZ"

     "(dog)" becomes "dog"

https://search.yahoo.com/search?p="(dog)"

https://www.google.com/search?&q="(dog)"

https://www.bing.com/search?q="(dog)"

https://yandex.com/search/?text="(dog)"

https://kagi.com/search?q=%22%28dog%29%22

https://www.searchenginewatch.com/2011/11/18/google-introduc...

48864w6ui2y ago

Nearly all customer facing computing is now optimized for people who don't know how computers work, not for people who do.

fuzzy_biscuit2y ago

Oh man, I miss the days when Matt Cutts was the de facto search liaison at Google. When I was doing an agency SEO, I read his posts and followed him with fervor.

e____g2y ago

(2010)

j / k navigate · click thread line to collapse