edit: I forgot all about google_maps.zip / waze.gz and all the juicy traffic data coming from android.. which probably already relies heavily on AI
To be clear, Google does use AI. They use it so heavily that they've designed four generations of training accelerators. All the fancy knowledge graph features used to keep you from clicking anything on the SERP are powered by large language models. The only thing they didn't do is turn Google Search into a chatbot, at least not until Microsoft and OpenAI one-upped them and Google felt competitive pressure to build what they thought was garbage.
And yes, Google's customers share that belief. Remember that when Google Bard gets a fact about exoplanets wrong, it's a scandal. When Bing tries to gaslight its users into thinking that time stopped at the same time GPT-4's training did, it's funny. Bing can afford to make mistakes that Google can't, because nobody uses Bing if they want good search results. They use Bing if they can't be arsed to change the defaults[1].
[0] Or at least they did, then they fired the woman who wrote it
[1] And yes that is why Microsoft really pushes Bing and Edge hard in Windows.
Ethics is a false excuse because rushing that out show they never cared either. It was just PR and their bluff was called.
Also I skimmed over that Stochastic Paper and I’m unimpressed. I’m unfamiliar with the subject but many points seems unproven/political rather than scientific, with a fixation on training data instead of studying the emerging properties and many opinions notably regarding social activism, but maybe it was already discussed here on HN. Edit: found here: https://news.ycombinator.com/item?id=34382901
You're exactly the kind of person Stochastic Parrots was trying to warn us about - you bought into the AI hype.
AI are extremely sensitive to the initial statistical conditions of their dataset. A good example of this is image regurgitation in diffusion models: if you include the same image n times in the data set, it gets n times the number of training epochs, and is far more likely to be memorized. Stable Diffusion's propensity to draw bad copies of the Getty Images logo is another example; there's so many watermarks and signatures in the training data that learning how to draw them measurably reduces loss. In my own AI training adventures[0], the image generator I trained loves to draw maps all the time, no matter what the prompt is, because Wikimedia Commons hosts an absolutely unconscionable number of them.
Stochastic Parrots is arguing that we can't effectively filter five terabytes[1] of training set text for every statistical bias. Since HN is allergic to social justice language, I'll put it in terms that are more politically correct here: gradient descent is vulnerable to Sybil attacks. Because you can only scrape content written by people who are online, the terminally online will decide what the model thinks, filtered through the underpaid moderators who are censoring your political opinions on TwitBook.
Of course, OpenAI will try anyway[2]. The best they've come up with is to use RLHF to deliberately encode a center-left bias into a language model that otherwise would be about as far-right as your average /pol/ user. This has helped ChatGPT avoid the fate of, say, Microsoft's Tay; but it is just sweeping the problem under the rug.
The other main prong of Stochastic Parrots is energy usage. The reason why OpenAI hasn't been outcompeted by actual open AI models is because it takes shittons of electricity and hardware to train these things. Stable Diffusion and BLOOM are the biggest open competitors to OpenAI, but they're being funded purely through burning venture capital. FOSS is sustainable because software development is cheap enough that people can do it as volunteer work. AI training is almost the opposite: extremely large capital costs that can only be recouped by the worst abuses of proprietary software.
[0] I am specifically trying to build a diffusion model trained purely on public domain images, called PD-Diffusion.
[1] No problem. We are Google. Five terabytes is so little that I've forgotten how to count that low.
[2] When filtering the dataset for DALL-E 2, OpenAI found that removing porn from the training set made the image generator's biases far worse. i.e. if you asked for a stock photo of a CEO, pre-filter DALL-E would give about 60% male, 40% female examples; post-filter DALL-E would only ever draw male CEOs.
This +100 Somehow there is a perception that chat bots are the only example of AI research or product that matters and all AI organisations ability will be judged by their ability to create chatbots.
Sadly, I think I'd argue that nobody has good search results anymore. Google's results have been SEO'd to the hilt and most of the results are blog spam garbage nowadays.
No, they turned google search into what it is now.
For me, trying google bard was an instant reminder of the change in behavior in google search from 15 years ago to today.
We used to have a search that you could give obscure flags to Linux commands and find their documentation or source code. Today we have a google search that often only tell you about how some kardashian or recent political drama is a sounds-alike with the technical term that you were searching for.
GPT4 has some of the same "excessively smart" failure modes, but it (and GPT3.5 for that matter) is so much more useful than bard (which hits the user with "I can't do that dave" 100x more often than chatgpt's already excessive behavior) that they're a useful addition to the toolbox. Too bad the toolbox hardly includes plain search anymore.
Despite what people often write and believe here, the access controls on PII data at Google are incredibly strict. You can't just arbitrarily train on people's personal data. I know, because when I was there, working on search backend data mining, in order to get access to anonymized search and web logs, I had to sign paperwork that essentially said I'd be taken to the cleaners if I abused the access.
> What gives, Google? Get on it
It's a very difficult decision to intentionally destabilize the space you are the leader in, for all the reasons you can imagine. In a sense, Google needed someone else with nothing to lose to shake up the space. How they execute in the new reality is yet to be seen. The biggest challenge they may have right now isn't technological, but that "ChatGPT" has become a sort of brand, like Kleenex and well, Google.
Many markets had early leaders who got stomped by later entrants.
I'd prioritize their problems like this:
1. LLM's don't have a lucrative business model that Google needs.
2. The quality of their language model is really lacking as of now.
You fix 1 and 2, ChatGPT's branding is nothing. Google is the biggest advertisement machine in the world and they can market the hell out of their product. Just see how Chrome gained ground on Firefox for example.
Google is still used several folds more than ChatGPT and if you resolve 1 and 2, Google will make their money and their users have no incentive to go to ChatGPT.
However, whatever's going on inside I still strongly believe in that company! Sometimes though it just feels like they don't themselves.
And yet Google is the largest online advertiser in the world. And yet, GMail used to (I don't know if it still does) push ads into people's inboxes.
I have as much belief in their PII controls as in their "Don't be evil" motto.
OpenAI subverted this by riding on the “open” part of their name at first—before doing a 180-degree turn and selling out to Microsoft.
Receiving traffic to sites is nice, especially for already highly-ranked results, but these are not the people buying the ads.