Disturbing Proof They're Quietly Deleting the Internet (opens in new tab)

(youtube.com)

14 pointskk6mrp3y ago16 comments

16 comments

16 comments · 5 top-level

salawat3y ago· 4 in thread

The thing about an actual index is it's reversible, and transparent. You've digested a bunch of information, and if I want to see the least relevant result by some query, that should be doable.

I would not be surprised if the folks at Google Search have forgotten that pne of the first tenants of organizing information is not to omit it.

Either that, or they implemented their delist functionality as "force relevance zero, and truncate results before you get there".

PaulHoule3y ago

Google doesn't think that way at all.

Back when Gerard Salton was writing the first papers on IR he had a set of 60 or so documents and kept his index on a deck of punched cards.

With a small set of documents the main problem you run into is that some of the relevant documents don't use the exact words in the query so you might miss most of the relevant results.

With Google on the other hand you could have millions or billions of relevant document and the challenge is to do so well for the first few results that odds are good (say 70% -- this is limited by the ambiguity of the query) that the first result is a "direct hit".

If you are answering questions like that in a huge distributed system the query process probably looks like a set of funnels that feed answers into funnels that feed answers into funnels. If you want to answer questions quickly at low cost the best you can do is kill low relevance documents as quickly as possible.

I worked on a search engine for patents that had multiple nodes and could get slightly different answers from different nodes because each node had a neural net for semantic indexing of its contents. You might have the system report that there were 15,091 relevant results one time, another time it might be 15,094. Management thought that our customers would lose confidence in our search because of this so we implemented something that made the selection of nodes used for a particular query deterministic, which hurt the scalability of our search.

salawat3y ago

Given that the distributed architecture for massively parallel processing adds complications.

Given that neural nets add in levels of "Oh, what the hell?" That explains a lot actually about the uselessness of Google results without getting really creative.

I still grade Google at failing to produce a semantically valid index, that checks off the usability criteria of an actual index.

If I have a corpus of data, I want to be able to examine the structure, even if only through a window. Way back before the through neural nets at things (if that is what they are doing now), you could actually get a sense for it.

I used to love trawling Google results into the hundreds of pages, because what you'd get real feedback w.r.t the effects of additional predicates on your query. It was more of a data processing and query refinent exercise than "throw it at their ML models and hope they decide to be useful today."

Organizing information isn't just about vomiting results... It's about imposing enough structure that people can help themselves it's like Architecture. A poorly planned building, or an excellently planned building optimized for discomfort is a hell on Earth.

One that actually reflects and accommodates the natural flow and needs of the occupants/users is a joy for all to behold.

Google had that. Now it doesn't, and it's increasingly difficult to get the darn thing to stop playing Bayesian/gradient descent/backprop buggers and just show me stuff that matches what I asked for in the Boolean sense, and don't you dare tell me there are only 13 bloody results.

There is search, then there is Search. I prefer the latter.

hammyhavoc3y ago

The word is "tenet", not "tenant".

salawat3y ago

Internal hash table collisions for English are a real wench at times.

mrguyorama3y ago· 4 in thread

Google has openly stated the first "number of results" numbers are literally made up, and have no relevance to actual things. It was way too expensive to calculate for every query when the vast majority of them ended after the first page, so they don't even try.

Nothing is being "deleted" especially not actively. This is also why you shouldn't use "number of results" as data in research, because it is meaningless.

This video is worse than misinformation and clickbait.

mrguyorama3y ago

I can't edit anymore but to anyone looking, I found stackexchange references to un-named googlers and other third hand accounts. I am wrong, Google is NOT open about this fact, but I am still confident in stating you cannot use the "estimated results" for anything.

My bet is that you can compare the original "estimated" results number to the actual number google gives you at page tenish for thousands of search terms and queries and find no relation between the two.

If you are still concerned about things google is ACTUALLY fucking with when it comes to search results, check out the Mozilla organization's research into the matter.

pauldenton3y ago

So Google making up the information they present to users is not Misinformation and clickbait, but this youtube video is.

ninju3y ago

citation please

hammyhavoc3y ago

Yes, this requires a source. I've never heard anything like it, especially considering all I hear from Google about AI.

blinded3y ago· 2 in thread

the good stuff stays the not so good get deleted, this isn't an episode of hoarders is it?

anon223345563y ago

I think the point is “what is the good stuff”

If it says 6.8 billion but only had 448 total…

I’ve had an issue in google where a study I found in 2012 is no longer available in 2022 even in their search

hammyhavoc3y ago

And the link to that study is what? Because the domain may have expired, data may have been pulled, company may have gone bust, DNS may be wrong, they may have been hacked and started serving malicious content. I could go on for about 30 minutes as to what might have happened to it.

PaulHoule3y ago· 1 in thread

Google's search results have never been particularly that good past the first page. Now that they are stuffed full of ads even the first page sucks.

rekabis3y ago

Unlock Origin. Doesn’t remove “promoted content” and other in-line crap, but it does remove all the content from their ad network.

pcdoodle3y ago

Break up Google. They are too big.

j / k navigate · click thread line to collapse

16 comments

16 comments · 5 top-level

salawat3y ago· 4 in thread

The thing about an actual index is it's reversible, and transparent. You've digested a bunch of information, and if I want to see the least relevant result by some query, that should be doable.

I would not be surprised if the folks at Google Search have forgotten that pne of the first tenants of organizing information is not to omit it.

Either that, or they implemented their delist functionality as "force relevance zero, and truncate results before you get there".

PaulHoule3y ago

Google doesn't think that way at all.

Back when Gerard Salton was writing the first papers on IR he had a set of 60 or so documents and kept his index on a deck of punched cards.

With a small set of documents the main problem you run into is that some of the relevant documents don't use the exact words in the query so you might miss most of the relevant results.

salawat3y ago

Given that the distributed architecture for massively parallel processing adds complications.

Given that neural nets add in levels of "Oh, what the hell?" That explains a lot actually about the uselessness of Google results without getting really creative.

I still grade Google at failing to produce a semantically valid index, that checks off the usability criteria of an actual index.

One that actually reflects and accommodates the natural flow and needs of the occupants/users is a joy for all to behold.

There is search, then there is Search. I prefer the latter.

hammyhavoc3y ago

The word is "tenet", not "tenant".

salawat3y ago

Internal hash table collisions for English are a real wench at times.

mrguyorama3y ago· 4 in thread

Nothing is being "deleted" especially not actively. This is also why you shouldn't use "number of results" as data in research, because it is meaningless.

This video is worse than misinformation and clickbait.

mrguyorama3y ago

If you are still concerned about things google is ACTUALLY fucking with when it comes to search results, check out the Mozilla organization's research into the matter.

pauldenton3y ago

So Google making up the information they present to users is not Misinformation and clickbait, but this youtube video is.

ninju3y ago

citation please

hammyhavoc3y ago

Yes, this requires a source. I've never heard anything like it, especially considering all I hear from Google about AI.

blinded3y ago· 2 in thread

the good stuff stays the not so good get deleted, this isn't an episode of hoarders is it?

anon223345563y ago

I think the point is “what is the good stuff”

If it says 6.8 billion but only had 448 total…

I’ve had an issue in google where a study I found in 2012 is no longer available in 2022 even in their search

hammyhavoc3y ago

PaulHoule3y ago· 1 in thread

Google's search results have never been particularly that good past the first page. Now that they are stuffed full of ads even the first page sucks.

rekabis3y ago

Unlock Origin. Doesn’t remove “promoted content” and other in-line crap, but it does remove all the content from their ad network.

pcdoodle3y ago

Break up Google. They are too big.

j / k navigate · click thread line to collapse