edit: incorrect is perhaps too strong, it is incomplete.
While it is true that click tracking can be used as a relevance signal, the people who were really pissed off when the data stream got dumped were advertisers who wanted to buy AdWords. That was a very simple system, pay someone for clickstream data, extract trending queries, front those with AdWord buys to get your page on the top of Google's results, and profit.
Having built a search engine and run it for 5 years, we got to see what people felt was relevant and what wasn't in a very loose way with click stream data. Basically you have a query and 10 blue links you can split the results in quartiles and figure out if the thing they clicked on was top half, bottom half, top quarter/second quarter etc. And do A/B testing to see how that played out. But what we found was that the best indication of what a page was about, was the text that linked to it. If you have an in-link to a page which was "<href='page'>great radio site"[1] then "great radio site" would be a query that should return that page which might be titled something like "bob's electromagnetic spectrum imaginarium" or something equally unlikely to come up in a query string.
So the bottom line is that there are lots of ways to try to determine relevance, click stream data is a part of that but by no means the biggest factor.
[1] neutered html for obvious reasons.
This is reflected in Google's search results. A Google query which can possibly be interpreted as related to a popular culture item usually will be. Google has become more aggressive about this over the years. Their "Did you mean" result tag once offered an alternative for a second search. Now, they return results for the more popular interpretation first.
The back side of search, page quality and ranking, is weaker than many think. Links are less useful than they used to be. Most links to business sites are now from "social" sites or forums, which are easily spammed. Using social signals was a disaster back in 2012, when, for a few months, Google went all-in on social signals. Google tried to recognize sites that "look like spam", but everybody knows that now and spam sites look better than ever. (The same thing happened with spam emails a decade ago.) Google doesn't recognize provenance, so they can be fooled by scraper sites. Google doesn't recognize the business behind the web page, so they can be fooled by marginal businesses. There are even SEO companies using machine learning to reverse engineer Google's algorithms, to find out how far they can go with keyword stuffing before a penalty kicks in.
Google does far more manual adjustment than they did two years ago. There's an army of people doing manual ranking, and a smaller unit handing appeals from manual penalties. There was a time when Google boasted they did no manual adjustments to ranking. The automation is starting to fail.
What Google did for innovation in smartphone\tablet\browser they have gone and done the opposite for search.
So how do you think Google is succeeding so well, if it's not click stream data? and why can't it be maybe a combination of things that strongly depends on click stream data that others couldn't copy?
Many people realize that if you put Google ads on Bing's results and Bing's ads on Google results the profitability would switch (not that I am entirely sure what that says other than having a credible search engine and top end Ad inventory is required to make excess money in search)
It will be interesting to see if Marissa gets back into the game with Yahoo when their agreement to use Bing results for Yahoo searches expires.
The interesting linkage is that you can't sell search advertising unless people send the search request to you, and if you're not the most common place that people search, you're unlikely to get first shot at advertising. You can "buy" traffic (that is called Paid Distribution) by putting your search box on people's web site, or causing someone's browser to send you search queries first, or paying a phone maker to send you all their search queries, but you have to make enough money from the ads to offset what you pay. And as I mentioned over the last 8 years Google has been paying more and more for their traffic (up to $968M last quarter) and very few entrants into the business are going to compete with that. If you already have a platform (like Mozilla has Firefox, Apple has the iPhone, Facebook has pretty much everyone's Facebook page) so you "own" the ingress point, you can leverage that with a good search engine to make a lot of revenue. But if you need to pay for access to the ingress point, and pay a big chunk to the ad provider, it is really hard to support a lot of infrastructure (which is proportionally expensive to index size). That is the constraint box of search today.
The interesting thing for me is that every quarter, of the last 16, Bing has been making more money per click and Google less, that cost equation is balancing out. That is going to put a lot of pressure on the non-core parts of Google.
To answer your question, Google succeeded well when capturing the value of linkage data to extract page relevance (the original Page Rank patent), they created an advertising incentive which made their algorithm break (you want a billion in-links to your page, no problem! say the black hat SEO folks). Google is still making tons of money on search but you can look at their performance over the last 4 years to see the air is coming out of the balloon. What comes next is still an open question.
This is interesting because of the browser choice enforced by the EU on Windows. IE whose default is Bing lost share to other browsers like Chrome, Firefox and Opera which all had Google as the default. So an attempt to fix the browser market totally distorted the Web Search market. I wonder why MS didn't request to the EU that the alternate browsers in the browser choice screen had to have Bing as the default search.
I wonder if the EU will mandate that search relevancy data must be shared by Google with rival search engines like DDG just like they mandated that SMB shares and Office formats must be documented by MS and released to developers.
Can Microsoft and other US-based technology companies theoretically just keep doing their own thing, tell the EU government "to hell with it, we're abiding by US laws, you have a choice to stop importing Windows and invent your own OS if you don't like us"?
What then happens if a competitor is established to take your former position in the European market - chances are they're not just going to stay in the EU. They're going to eat your lunch elsewhere too.
Google's biggest PR success is convincing everyone that the quality of web rankings depends almost purely on algorithms. It does not. What allows Google to hold their monopoly is the $100s of millions (or more) they continuously pay to amass more manually created training data:
http://www.theregister.co.uk/2012/11/27/google_raters_manual
http://www.forbes.com/sites/timworstall/2012/11/27/is-google...
A new search engine could appear today with algorithms 10x better than Google, but without access to this scale of training data, their rankings wouldn't even be close to Google's quality.
Google maintains their position by paying cash for this monopoly on training data made by tens of thousands of $9/hour workers, not through superior algorithms!
Computers introduce a means to lock people in that don't exist in other markets. In software products there are often ecosystems that tie directly in to the product/service which are not required to be shared with competitors unlike with road systems for cars.
Regulators ought to look into ways to enforce measures that require the companies to completely open their ecosystem to competitors. Or look into ways to standardize these ecosystems and require every service/application/website comply with them (similar to how media companies are forced to include closed captioning).
I would think a Windows laptop or a Macbook because the users and developers can install or develop any application, yet we have everyone singing the praises of heavily DRM'ed and locked up Chromebooks and iPads. Sometimes I feel it's more about Microsoft hate than about a free computing environment. At least RMS is consistent and is less prone to company fanboyism than the tech crowd.
Because any number of 3rd parties have been injecting their ads and other crap as MITMs. SSL is a better, but not foolproof way to make sure the content you get was the content served by the remote server.
(a) At the time Chrome was launched, IE was dominating with ~69% market share: https://d28wbuch0jlv7v.cloudfront.net/images/infografik/norm... And, Firefox/Mozzila was topping out at 25% market share! They were basically resting on their laurels! Remember that the SPDY protocol which is the prototype standard for HTTP 2.0, was invented at Google and was the main innovation within Chrome 1.0. If you do a timed google search 2008-2010 for SPDY you will see that the SPDY whitepaper page was Nov 12, 2009 : https://www.chromium.org/spdy/spdy-whitepaper So Chrome was launched to make web browsing faster.
(b) Google Search does not want to be excluded from all browsers. The solution to this problem is to fund your own browser. If IE will dominate Firefox forever and Google was depending on Firefox defaults for much of its search traffic, then Google was virtually FORCED to create its own browser or they could always be limited to 25% (or less) search traffic share.
I think that having a "Browser account" which synchronizes browser bookmarks and settings and history across all instances of Chrome for a given user, is one of the greatest improvements in browsers in the past 5 years, and all other browsers seem to be copying this idea. If google were the evil empire as you imply, it would be suing the pants off these other browsers, but it is not.
MS didn't do that from IE, they did for users who installed the Bing bar, a huge difference.
The author states that "For some 90% of searches, a modern search engine analyzes and learns from past queries, rather than searching the Web itself, to deliver the most relevant results." This may be true in some types of searches but overall, I think the statement is misleading.
Rather, it's better to think of it like this: One important part of the algorithmic process involves constantly crawling the web and updating the index with new information. (Important / frequently-updated web sites may get crawled all day every day, while ones that are less important may get crawled only weekly or monthly). Meanwhile, another part of the algorithmic process constantly analyzes new info discovered in the crawl and combines it with, as the author-mentioned, click-through data learned from past queries.
The answers to many queries don't change, while the answers to many other queries deserve freshness. For example, I'm quite certain Einstein's date of birth hasn't changed in quite a while, but his theory of relativity is in constant discussion and there is always new information and new queries pertaining to it. As a result, there is not much need for a search engine to go digging for the latest info on an "einstein's birthday" query, but it's to everyone's advantage that Google is able to identify which pages on the web deserve priority crawling and that Google has retrieved and incorporated the fresh info those pages contain into its index when it comes to a topical type of query like "diffraction of light with quantum physics".
In the end, the results to every query depend on info gathered from the web and user data helps refine the results. Info that is more static can be prioritized with more input from click-through data, while new information found on the web must rely more on Google's artificial intelligence to push it up in front of searchers.
Another reason that that "90%" statement sticks out to me is that there is a fairly often-used factoid tossed around industry experts that between "6% to 20% of queries that get asked every day have never been asked before." Google can't rely heavily on past query data for all of these type of searches.
Also ,if such changes we're to be made, there's a decent likelihood that someone would have noticed that data leakage and told us about it.
So since mozilla is a pretty decent company, we should currently give them the benefit of the doubt.
Firefox is still open source, unlike IE, Safari and Chrome, so just look.
[1] Smaller than Google. The search box isn't given away for free.
Has anybody noticed this happening ?
Google does not "track the full browsing experience of chrome users". Please read the privacy policy which is very clear on this subject: https://www.google.com/chrome/browser/privacy/
I particularly draw your attention to this paragraph: "If you use Chrome to access other Google services, such as using the search engine on the Google homepage or checking Gmail, the fact that you are using Chrome does not cause Google to receive any special or additional personally identifying information about you."
"If you sign in to Chrome browser, Chrome OS or an Android device that includes Chrome as a preinstalled application with your Google Account, this will enable the synchronization feature. Google will store certain information, such as HISTORY, bookmarked URLs as well as an image and a sample of text from the bookmarked page, passwords and other settings, on Google's servers "
And this isn't that far from full browsing behavior.And that's from a few minutes reading this page - we don't know if they track deeper details - like how long the page was open.
Also - Google doesn't have to collect this data. The claimed purpose of this is that you could share history on multiple devices. But this can also be achieved by sending encrypted history to Google and decrypting the history on each device you use(i think browser extensions with similar functions implement this in that way). So it's clear the purpose here is collecting data.
Also, FYI, the privacy policy is irrelevant. It only stats what a company might or might not do. The Terms of Service is the real deal - legally speaking - which of course Google can and does modify at their will, anytime they want to collect more data.
----------
"When you upload, submit, store, send or receive content to or through our Servicesǂ, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. "
"This license continues even if you stop using our Services (for example, for a business listing you have added to Google Maps). "
http://www.google.com/intl/en/policies/terms/
https://www.google.com/chrome/browser/privacy/eula_text.html
ǂYour use of Google’s products, software, services and web sites (referred to collectively as the “Services”
Also note that for Google services there's no point in collecting that data because Google already has it.
For example after learning i like the results of a certain journals ,their personalization engine offered me those in releated searches. and usually i chose content from them.
But somehow, after some time, Google's personalization engine forgot that i like them ,and stopped offering me content from them, so i'm back into drowning in shitty results. Why ? no idea why.
This article makes a number of bold claims about the contents of data and code which its author hasn't seen, and is written by a company that is receiving a large amount of money from Yahoo. I would encourage people not to forget these details.
In any event I'm not a person who can decide to release that information. All I can do here is to ask people to think about what evidence has been offered and the motives behind this article.