Does anyone with a NLP background care to take some guesses on how the synonym extraction methodology worked? My only piece of information is that it likely used the query log itself to do so.
[I am not a fan of patents, but to the extent they have any positives they in principle serve to share knowledge about how inventions work. Also I am not a lawyer but I think patents last 20 years from filing date and these were filed ~20 years ago maybe?]
I wish there was some kind of readability requirement for patents , if they are to continue to exist.
Your point on dates is something I did want to call out - I wouldn't be asking this if it wasn't ancient history. I have no interest in doing anything sinister. Just trying to explore a fun part of Internet history. Any shot I could shoot you an email to chat?
I know “Google Desktop” used to be a product years ago. What’s the state of that space today?
Emphasis on "in principle". Most patents - especially software patents - are completely unintelligible. They also tend to describe the system enough that they can sue people that do the same thing, but no-where near enough that you could actually implement it based on the patent.
While the Matt Cuts era search tech is interesting, it’s crucial to keep in mind that the dataset was very different then too as a result of Matt Cutts’ own attitude towards spam and SEO.
Back in 2010 LDA was big and Google had used probabilistic networks e.g. Rephil / large noisy-OR networks as models
https://uh.edu/nsm/computer-science/events/seminars/2016/110...
Would the same things work today given how SEO spam and Google ads work? The same models are probably useful but it’s the noise and the long tail of the data that makes the problem hard.
For example, with session, you can detect manual query rewriting, and use this as a signal to see which queries are close to others in the time context. You can do various fancy things from just that.
Nowadays, a simple way to start would be to use SOTA LLMs to generate synonyms offline, and use this for query expansion at query time. At least in a context where queries are small, that should give decent results. This has however diminishing returns because of cost (the more synonyms the more expensive querying the index), and also you lose precision with diminishing returns on increased recall.
Ofc, for complex search like google, I am sure it is much more complicated
Any chance you have any open source links that discuss how you practically operate a system based on the concept you describe (manual query rewrite w/i a session as your data set)? Perhaps it's obvious to an NLP person how to reduce that "idea" to practice, but it is not to me!
You're definitely right about the idea though - a former Search engineer obliquely mentioned that this sort of session based manual query rewriting was very core to how the synonym system worked.
Oh what a dream if that were true! Instead, every year, the 'synonyms' get broader and broader. To me it looks like they're using synonyms of synonyms (of synonyms).
One thing is surely true: Google abhors a vacuum, they will show you results, no matter how tenuously connected to your query.
Forget about synonyms of synonyms - I've seen antonyms bolded as matches in google search results. I have to imagine I'm not the only one.
There seems to be a part of algorithm smart enough to figure out that both Linux and Windows are operating systems, but not smart enough to realize that the difference usually matters a lot for the person asking the question.
That said, skipping synonyms and other query variant generation is definitely throwing the baby out with the bathwater. When it works well it massively increases the recall of the search engine at very little loss of precision, which is important given the scale and noisiness of web results.
I disagree, I want control back. They've neutered almost every query narrowing option they used to have. They don't care anymore they just want to give you results.
I'm not saying the, lets call it 'query automation', is always bad, I just want the option to turn it off to some degree. Apparently we can no longer be trusted with that.
So I tried to google various variations of "separator line latex math" but it always synonymed them to include "line break latex", which is obviously "\\" and not what I want.
It was written, and was meant to be read in, an exasperated and slightly dramatic tone. I thought that was obvious from the opening sentence, but apparently not.
From the article linked from this blog post: "Enabling computers to understand language remains one of the hardest problems in artificial intelligence."
In modern RAG applications we return top-k results for this reason - it can't simply give the correct snippet in one result, leaving the hard part to the LLM to make sense what is useful and what is not.
"A+Z" becomes "A-Z":
https://www.google.com/search?q="A%2BZ" "(dog)" becomes "dog"
https://search.yahoo.com/search?p="(dog)"https://www.google.com/search?&q="(dog)"
https://www.bing.com/search?q="(dog)"
https://yandex.com/search/?text="(dog)"
https://kagi.com/search?q=%22%28dog%29%22
https://www.searchenginewatch.com/2011/11/18/google-introduc...