Big Tech's underground race to buy AI training data (opens in new tab)

(reuters.com)

152 pointstwilightzone2y ago146 comments

146 comments

htrp2y ago

>Rates vary by buyer and content type, but Braga said companies are generally willing to pay $1 to $2 per image, $2 to $4 per short-form video and $100 to $300 per hour of longer films. The market rate for text is $0.001 per word, she added.

This is high enough that there should be a market to compensate the end users who created these

mitthrowaway22y ago

> The market rate for text is $0.001 per word, she added.

I'm astonished that a picture turns out to be worth a thousand words.

acid__2y ago

Top multimodal models have about 200-1000 tokens per image, so the math works out.

Gud2y ago

On the other hand, per byte the word is more expensive

mdnahas2y ago

I’m an economist. This is an example of a volume discount. Prices often decrease per-unit when buying larger quantities. That happens whether it’s milk or the square-footage of an apartment. I’d expect larger files to be worth less per-byte. Photo and video files tend to be larger than text ones.

mucle62y ago

I love this fact! I would have never realized it

vlovich1232y ago

These are probably pretty arbitrarily priced so someone thought they'd be cute & everyone else picked up this rate.

1 more reply

paxys2y ago

There is a market...to compensate the platforms where creators uploaded the data for free.

golergka2y ago

If you give something away when it's worthless, don't come back for more when it's discovered to be worth more.

Users of these sites have had license agreements and privacy policies for a long time, and freely gave away their content just because free web hosting was worth it. Why would they be entitled to anything more now that this content have found new value?

jlund-molfese2y ago

There was https://en.wikipedia.org/wiki/Datacoup , which tried to create a market, but they ultimately went under and the brand is now used by an unrelated company.

altdataseller2y ago

Are certain types of textual content more valuable than others? For instance, conversations vs long form content vs short form (ie tweets)

m3kw92y ago

Are counterfeit words then AI generated? Just like money you need a very good “press” and hard to detect..

miki1232112y ago

With what's happening in the EU with the GDPR on one hand and with the DMA on the other, I wouldn't be surprised if this becomes the new business model for social media companies.

neolefty2y ago

This market is troubling. But I have a different question:

What does the long game look like for raw training data? How will AIs maintain the quality of their diet?

To compare, web search started — in the early days of Google — as a huge win because so much valuable information that was scattered around became findable. But over time it has become whac-a-mole with spam and AI copypasta, and now it's a struggle to keep returning good results, for any search engine.

__MatrixMan__2y ago

Just like how ads have integrated into everything, trying to get us to click away from the happy path, AI will be in everything, trying to get us to do things that it is not yet good at so that it can learn from us. Which would be fine if the newfound efficiencies were properly democratized.

mateo12y ago

Yep. All these tech giants are taking the labour that people provided to the world in good faith in what I've seen described as a gift economy, and trying to lock it up. On the internet for the longest time people were providing their knowledge and fruits of labour for free, anticipating reciprocity (which on average they got). They stopped when reciprocity stopped. Platforms would monetize their efforts, control the distribution and often remove the reference to the creator.

These AI systems are being build on top of all the collective effort and resulting knowledge of the entire humanity. We can pretend they are just another private enterprise or we can acknowledge that they are something more than that.

And it's not just the productivity we could achieve with democratizing these systems. There's another danger. When big companies buy up all this intellectual property, what better choice would they have than to lock it up? At least until recently you could argue that IP rights owners were as entities incentivized to proliferate this knowledge, now the opposite is happening.

__MatrixMan__2y ago

Do you have a concrete example of something you're afraid of loosing access to? The examples that come to my mind are such that cutting people off from them would degrade the relevance of the AI that's trained on them, but maybe I'm overlooking something.

Like, if you prevent access to research in order to protect the moat around your AI product, you'll harm the research community that would otherwise be your users. So now they're looking for other jobs and you have no users.

bilsbie2y ago

I wonder if they’ve considered hiring people to write. A lot of people might do it for cheap just to have their imprint on AI.

Or another twist pay people to submit ten years of emails (upload the backup file) or just pay small amounts for works they’ve made. College essays, journals, etc.

tracerbulletx2y ago

I have to imagine the valuable training data is domain specific stuff like sales call recordings for specific industries and technical materials about specific topics owned by companies. Surely there is enough public or copyright free general purpose material.

1 more reply

passion__desire2y ago

This won't be necessary in future AIs. As AIs will start aligning tokens from all the rich modalities of audio, video, 3D with text so that they can express complex ideas, they will bootstrap in proper language generation.

I don't think college essays, etc would contain anything novel. Future techniques could smoothly interpolate better creating ever-anew wordmud.

nicklecompte2y ago

I agree with your overall point that an AI which can learn about the world directly won't need eleventy billion documents to learn language generation. Just two comments:

1) Based on how pre-verbal children learn, one nitpick is that I strongly suspect we need to give AI touch and a sense of space in order to truly understand quantity, causality, object permanence, etc.

2) Something that is not a nitpick: even a superhuman multimodal AI wouldn't have direct access to human emotions, sexuality, ideas of natural beauty, etc. I don't think humans have run out of interesting things to say about these ideas.

(In particular, I don't think a superhuman AI is capable of understanding music unless it is directly emulating the biological processes by which humans understand music. The issue is not "logical" - melodies don't actually make sense analytically.)

dleink2y ago

> I don't think [things created by humans] would contain anything novel.

That's quite a proposition.

passion__desire2y ago

Not every essay is created equal. Plus I don't understand what would a new way of combining same words, given llms already have seen trillions of tokens, would achieve. llms could inpaint to arrive at similar texts.

slyall2y ago

Turnitin will have millions of essays written by students. No doubt they will already be looking at these deal (or getting ready to update their license if it currently doesn't permit it).

cdme2y ago

They're more interested in eliminating jobs than creating them.

SnowflakeOnIce2y ago

This already happens. I have seen recruiters trying to get domain experts in various fields to write articles for AI training.

TrueDuality2y ago

LinkedIn built a whole platform inside their platform for doing exactly this. I think you get a badge or something on your profile claiming your an expert on something if you write a couple paragraphs on a topic using the provided prompt.

They're very clear its going into an AI generated article on the topic but you better believe that is also now core training data.

laborcontract2y ago

Most companies are hiring for the role of AI Tutor. Some of that is definitely happening.

EVa5I7bHFq9mnYK2y ago

People will just use ai to write those essays and emails!

layer82y ago

This will be a fun reminiscence once we find out how humans are able to learn with just a tiny fraction of that data volume.

mattgreenrocks2y ago

Despite all the hoopla around AGI, the sheer amount of data required really makes human learning all the more impressive.

Gödel probably consumed a miniscule fraction of what these systems have seen. And look what he came up with!

1 more reply

dkasper2y ago

Not sure this is a good conjecture. The main reasons are 1) AI’s are expected to have incredible range that the average human does not 2) humans actually do take in enormous amounts of data but it happens over the course of many years and most of it is audio/visual/tactile/experience.

We already see that if you want to focus on a narrow skillset you can use a much smaller model and training set. But right now it is a race because everyone wants to be the one true generalized intelligence model.

sigmoid102y ago

The data volume is actually not that different once you account for all senses and how many years it takes for a human to become useful. The interesting thing would be how the human brain filters out the unimportant information as it develops.

llm_trw2y ago

That's a distinction without a difference. The majority of data is from a distribution that's already been sampled multiple times.

E.g. how often does a baby go out and experience something novel? The majority of it's time is spent getting the same stimulus over and over again, as anyone listening to childrens television can attest.

Humans learn in fundamentally different ways to our current systems and information poverty is not a problem for us.

sigmoid102y ago

And what do you think epochs in machine learning are? Or why more modern training efforts (i.e. for LLMs) are focussing hard on deduplicating scraped data?

1 more reply

gdsimoes2y ago

Because we actually think? I'm not just trying to guess the next word and I understand causal relationships.

quietbritishjim2y ago

The real difference with human learning is feedback: when young humans learn, at least some of the time they are interacting with intelligent agents that are able to give them focused feedback on their recent inputs and initial reactions to them.

nicklecompte2y ago

I think this ignores the most essential feedback very young humans get: planet Earth itself obeys laws of physics, mathematics, logic, etc. And by age 2 human children already have far faster and deeper reasoning abilities than any contemporary AI, even if their lack of linguistic knowledge means they wouldn't perform very well on LLM benchmarks.

In general AI researchers have done a very bad job exploring how a system might be "near-human" according to some fancy linguistic benchmark, yet dramatically dumber than a pigeon in terms of general reasoning abilities.

weregiraffe2y ago

Tiny fraction... if you ignore the learning data processed by a billion years of evolution.

layer82y ago

It’s a good question what portion of our DNA contributes to the information processing and knowledge in our brain.

However, the first complex nervous systems came about in the Cambrian explosion, only about half a billion years ago. And we also don’t train LLMs by random mutation and selection, it’s a much more teleological process.

But to extend the analogy, we should be able to train a model continuously, and not have to start training from scratch for each new model. Although, maybe, that would require random mutations, and thus much more time?

cdme2y ago

All the more reason for comprehensive privacy/data protection legislation and a refusal to provide data to these companies wherever possible.

Shawnj22y ago

The fact that ChatGPT isn’t deemed copyright infringement is absurd. Like you can’t take the entire internet and use it to train your software and claim you’re not violating the copyright of thousands of people

jsheard2y ago

If the predictions that traditional search engines will be displaced by LLM engines turn out to be correct then there will have to be a reckoning about copyright. It's already difficult enough to make money by writing online, but if most content gets consumed second-hand through an LLM then it will become basically impossible. How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?

autoexec2y ago

> How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?

NewsGPT won't just regurgitate the work of journalists. First it'll consider the paid "partners" of NewsGPT to make sure to downplay anything that might hurt them, then it'll do the same for their advertisers while inserting some ads in the text, then they'll give the article tweaks according to NewsGPT's own ideology and then finally spit out something very different at their users. Maybe they can argue that NewsGPT is too transformative to count as copyright infringement.

cdme2y ago

How are you supposed to trust "journalism" from a text generator that hallucinates? The information ecosystem is bad enough without running it through a text blender that's already hitting compute, power and data limits.

1 more reply

photonthug2y ago

Regurgitation seconds after is what already happens with the AP though. There are some real journalists that will sadly be pushed further out of the fold, and presumably many human but fake journalists that have been coasting for years on such regurgitation. I’m not so optimistic about the ai future, and believe payment or at least credit really needs to get figured out for generative stuff. But real content producers should direct some of the irritation at their editors, colleagues, and industry or else it’s all rather dishonest isn’t it?

2 more replies

ravelfan2y ago

Journalist are ultimately extremely overrated in 2024.

I go out of my way to not consume news outside what happens to cross my way because of financial markets.

What exactly do you think I am missing that is so important? Journalist by large produce complete nonsense in 2024. Journalist in 2024 are a massive net negative and would be much better served doing something productive, like selling apples on the street.

jedberg2y ago

The main counterargument is that you have read 1000s of documents to train your brain which produces unique documents with no credit to the original copyright holders.

GenAI is just doing the same thing on a larger scale.

Terr_2y ago

Well, if the concept is "we should legally treat the systems just like very big humans", then the next step is to arrest and confine all the leaders of the companies involved on charges of slavery and child exploitation.

The distinction does matter in copyright too, since a transformative work needs some non-trivial amount of human input.

1 more reply

Shawnj22y ago

If I offered a paid service where you could pay me $20 a month and I would draw you copyrighted works that are in my internal neural network that would also be illegal

cdme2y ago

Frankly, it is and should be treated as such. The fact that they're dodging questions about their data sources is a red flag and a pretty clear indication that they know they're in the wrong and are fighting to become established enough to be in a position to, at best, ask for forgiveness after the fact.

spxneo2y ago

isnt that what Google did ? they scraped the internet but the public/econ advisors felt the benefits outweighed copyright violations, they were just "indexers", they weren't scraping "news" they were indexing it lol

same thing with emulators and roms. somebody dumped the cartridges (copyrighted software) into ROM files to be played on emulators (copyrighted bios) but they were "archiving" and if you owned the original copy you could download them. I still vividly remember seeing on warez website disclaimer: "DMCA SAFE HARBOUR NOTICE: YOU MUST OWN THE ORIGINAL GAME OTHERWISE ITS ILLEGAL BUT YES, YOU CAN DOWNLOAD EVERY SINGLE GAME MADE ON THAT CONSOLE FOR FREE"

I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"

tldr: "indexing" ---> "archiving" ---> "training"

cdme2y ago

Google surfaces data — or it used to — LLMs and AI companies actively exploit it with zero benefit given to creators or users of the platforms they're now cannibalizing.

1 more reply

Shawnj22y ago

The same benefit doesn’t exist for ChatGPT as Google because Google means people click on your site and you get ad revenue. Google even facilitates this in both directions with search ads and as an ad service you can get paid from for hosting ads. The ROM site DMCA thing was always BS lmao it’s completely legal for you to dump your own carts and use them in emulators but that freedom doesn’t extend to having a copy of someone else’s game cart. That’s just an intentional misunderstanding of the DMCA in a futile attempt to not get banned

1 more reply

1024core2y ago

Companies like Quest Diagnostics (a lab testing firm) are sitting on a goldmine of clean data. It's only a matter of time before a firm like Amazon (who already bought One Medical) gobbles them up.

Disclaimer: Long on $DGX

Shrezzing2y ago

>in talks with multiple tech companies to license Photobucket's 13 billion photos and videos

>Photobucket declined to identify its prospective buyers, citing commercial confidentiality.

>tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps

In this market, ethics seem to exist when it comes to corporate clients, but not when it comes to end-users.

It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.

Centigonal2y ago

Photobucket is a morally bankrupt shell of its former self. They send constant emails with extremely urgent subject lines threatening to delete your photos unless you sign up for a $5/mo plan. They do this even if your account doesn't contain any photos.

itronitron2y ago

This is funny. If they delete your photos then they lose their lever for getting you into a payment plan. Except they'll probably email you a recurring one-time offer to restore your 'deleted' photos for a nominal fee.

kristianp2y ago

> threatening to delete your photos unless you sign up for a $5/mo plan

What's morally bankrupt about that? It costs money to host your photos and they're a business that can decide to charge their customers any rate they think the market will accept.

Centigonal2y ago

I have no photos on their service and they've been emailing me weekly since last year with URGENT and ACTION REQUIRED.

bonton892y ago

> It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.

I can think of worse things than that which might be hidden away for public scraping.

50402y ago

>In former times it was maintained that ownership of landed property extends from heaven all the way down to the center of the Earth, but this doctrine is obsolete, as evidenced by the flight of airplanes.

soulofmischief2y ago

I vividly remember consenting to all variety of terms agreements as a 13-year old on the web in 2007. I also remember explicitly licensing all of my output as CC and embracing copyleft. It's never been a secret that even captchas contribute to the improvement of models designed to ultimately sell ads to eyeballs.

A lot of people just were not paying attention to the game being played, and so now they're getting played themselves.

BriggyDwiggs422y ago

When a company hides their skeevy practices in 30-page social media term consent form, I blame them a lot more than the normal person with limited time or the literal child. I’d prefer such people didn’t “get played” by multinational corporations, even if they potentially could have prevented it.

soulofmischief2y ago

I think it's important to understand that consent was indeed given, and most users likely understood that they did not own any non-copyrightable portion of their user-generated content.

Rather, the conversation should focus on how to improve parsing of ToS (I personally believe we should use symbolic labeling like we do with food), as well as regulation around what terms can change for content which was generated under the premise of an older ToS.

OP's statement, "It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid," is simply false. Many, if not most users, understood that they gave permission for their UGC to be used to improve the services. This is what I am rebuking.

bilbo0s2y ago

It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid

Unfortunately, they did actually. It's more accurate to say that they were presented a EULA and Terms of Service that no reasonable teenager would have had any hope of understanding. But since they're over 13, they're held to the terms of those agreements in any case.

These companies are slimy. Make no mistake, this will get worse in the future.

qeternity2y ago

People have known for eons that companies were using their data. TikTok is well publicized to be the CCP. And yet millions of people would rather have entertainment. There are plenty (myself included) who abstain, but the reality is the vast majority of people, if presented with free and unlimited dopamine hits, will gladly give away their info.

digging2y ago

> the reality is the vast majority of people, if presented with free and unlimited dopamine hits, will gladly give away their info

I counter that the reality is the vast majority of people do not meaningfully understand the exchange they are making. I'm not saying they're stupid or blaming them whatsoever; it's a similar phenomenon to playing the lottery. Our brains aren't equipped to understand such unintuitive phenomena.

1 more reply

BriggyDwiggs422y ago

Do you consider such people to blame? Under your own framing, they’re more or less being taken advantage of by a modern form of drug dealer.

1 more reply

digging2y ago

There's a funny thing where the legal/commercial definition of "consent" is essentially a subset of "non-consent", having extremely little overlap with "consent" in a meaningful way.

nico2y ago

They talk about voice samples, but they don’t mention prices for them

Would it be attractive for a company like Twilio or Aircall to offer free phone calls and sell anonymized recordings?

1024core2y ago

Funnily, this is how Google improved their voice recognition.

Remember a decade or so ago, you could call a 1-800 number and look up phone numbers using your voice? It was backed by Google and once Google was done collecting the data, they shut it down.

htrp2y ago

Google411

tomschwiha2y ago

It would solve all government budget issues if the three letter agencys would start selling all data.

Cthulhu_2y ago

No, that's gross violation of privacy; no such thing as anonymized recordings.

nico2y ago

It would be a violation of privacy if people weren’t aware/hadn’t consented

But if it was part of the terms of the new free service, and all the parties involved got a reminder message on the call… you might still not like it, but it doesn’t seem like it would be a violation of privacy

gogogo_allday2y ago

I’m not a lawyer, but I do live in a one party consent state. I would imagine if I setup a service here, ensured all calls originated in my state, and the person who owned the account being used consented, it would be legal. Even without informing the person on the other end of the call.

Would this violate other laws outside my jurisdiction? Probably, but that just means I won’t travel there.

I actually hope I’m wrong.

soulofmischief2y ago

A fantastic example of why the inherent "lack" of one party in an economic exchange is a necessary component of modern capitalism.

The only people who would be willing to use such a service are people who have likely already been systematically disenfranchised by our global economic system. Poor people.

Privacy should not be incentivized and treated as a luxury. Especially when the end result of all this training data is models which further discriminate against vulnerable third-parties and automate maximum value extraction from the average user via unprecedented amounts of emotional manipulation afforded to us by the development of user-facing generative AI. Whether through highly-targeted, ad-hoc advertisements, or discriminative insurance policies.

asattarmd2y ago

Google having so many private photos in Google Photos must be a goldmine for them.

Melting_Harps2y ago

> Google having so many private photos in Google Photos must be a goldmine for them.

While true, it's META who has won that arm's race long ago in my view; hell, they just disclosed that they have private access to DMs to Netflixh [0] in a lawsuit.

If you don;t think they are training their own models on this data over all their platforms you have to be a complete idiot o: Facebook, Instagram, Whatsapp.

That is a much larger treasure trove given the sheer scale of people on those platforms, Google is limited to mainly Android users and those who use it's suite on PC (relatively small compared to social media users), which excludes most Mac users.

The thing they don't tell you about this dark underbelly of AI is just like the (meta)data that is for sale to 3rd parties, it's tiered price structure wherein Mac users are often the premium tier de to their more 'affluent' status and likelihood of impulsive in app purchases.

This is why I think META already won the AI race, they opensource Llama and have the a massive treasure trove of data to refine and train when they see what the OSS community creates that is of actual value: ChatGPT/DALL-e runs at a loss for MS/OpenAI. But if anyone can monetize this gold rush it will be META.

And perhaps more critically from an infrastructure POV, Llamma now runs better on CPU [1] rather than GPU, which means they won't have to be constrained or price pinched on GPUs like Microsoft, Google, Amazon likely will due to demand constraints from Nvidia (see ETH mining craze during COVID). They can focus on optimizing their data centers with more free cash flow which meant they can have a bigger footprint for when they finally figure out how to properly monetize this AI bubble, because it is is a bubble, from now until then.

I think Zuck learned from Libra that staying out of the limelight during a bubble is critical if he wants to undo the Metaverse money-pit/losses.

0: https://www.movieguide.org/news-articles/facebook-allowed-ne...

1: https://news.ycombinator.com/item?id=39890262

flir2y ago

> Google is limited to mainly Android users

https://www.appmysite.com/blog/android-vs-ios-mobile-operati...

Random link. Can't vouch for it. But US and RoW have quite different patterns.

Melting_Harps2y ago

> Random link. Can't vouch for it

Seems about right to me, Android dominates the mobile World by sheer numbers.

But what is the value that they can derive from user data? A million Bangladeshi's texts from food delivery is probably a lot less valuable than say a Singaporean using Numbers on Mac OS to layout the next lucrative investment and the data they;d get from the correspondence of say 100 high net worth individuals hidden behind iOS (Pegasus MITM attack notwithstanding).

Again, the name of the game is to derive signal from noise from data, bulk collection is primitive when training models and often incredibly difficult to work around once it is in. I seriously think Gemini had this problem, along with QA/QC issues, rather it going from so-so Bard to total 'woke' Gemini. I may be wrong, but I think this is what happens when you go down the bulk collection and unfiltered/un-curated data route.

1 more reply

wobbly_bush2y ago

Whatsapp chats are encrypted, how can they be used to train the models? Also what kind of training can be done on Instagram data, is there anything of value there?

Melting_Harps2y ago

> Whatsapp chats are encrypted

While they claim E2E encryption, I seriously doubt they would offer this service entirely for free with having some backdoor or potential MITM breach that they likely tucked away in the ToS given the wide use of it it most of the World who pay for SMS/text messages: it just seems so incredibly unlikely to be entirely encrypted from a company that willing gave DMs to Netflix, used Cambridge Analytica etc... But even if it is encrypted, the meta data generated can tell you a lot too--as was the case with Pokemon GO--that may not directly benefit LLMs, but could help with creating dark patterns that make your AI companion (under the guise of an LLM) the 'must own' when deciding who to buy tokens/compute from.

Speculative for sure, but just look at the Twitter file leaks revealing how social media platforms willing work alongside intelligence agencies.

1 more reply

digging2y ago

> Also what kind of training can be done on Instagram data, is there anything of value there?

Billions of comments and private messages; billions of data points on user behavior and (more importantly) how they respond to manipulative UI/UX/content... Nothing useful there??

1 more reply

altdataseller2y ago

As well as emails, documents, reviews…

JohnFen2y ago

I am incredibly thankful that I never used any of those services. I'm angry enough at the thought that my own websites may have been scraped to train LLMs, but at least I could remove that content. I'd be beside myself if I couldn't do at least that much.

1vuio0pswjnm72y ago

No Datadome Javascript:

https://www.usnews.com/news/top-news/articles/2024-04-05/ins...

xnx2y ago

I assume some of the more shady/no-name dashcam units with Wifi capability are uploading their video and internal microphone recordings. Distributed surveillance: The Panopitcar

digging2y ago

Any modern car is likely to already be transmitting that data and more, such as your weight, metadata about your doctor visits, etc. Cars are a privacy nightmare.

flir2y ago

I've wondered about crowdsourcing that. Sousveillance. Don't think enough people would be interested, though.

spxneo2y ago

Nobody's going to mention Worldcoin?

angryasian2y ago

I still speculate PG's golden boy was fired for unethically sourced training data for gpt 4 but we'll likely never get the real story.

sylware2y ago

I wonder when one of the richest corps will manage to get exclusive access to such data and lock out the others.

bilbo0s2y ago

Never.

Because no one will sell them an exclusive license to the data.

The companies selling this data are slimy. They're borderline crimelords. Picture a pirate captain with a hostage that he is ransoming. Now imagine he gets his ransom, but before he releases the hostage he makes a copy of her. Then ransoms the copy to another interested party. But before he releases the copy, he makes another copy and... you get the idea.

It's pirate thinking.

"If one hostage is good? Then two are better! And three? Well, that's just good business!!!" -Hondo Ohnaka

ganzuul2y ago

GDPR covered data should be worth a lot less.

outside12342y ago

Ha - I love your optimism that they are even considering GDPR

mostlysimilar2y ago

Who could have guessed giving away all of our data to corporations wholly focused on profit would be a bad thing?

dnissley2y ago

If the end result is ai chat agents that anyone in the world can access for free, that seems like an absolutely wonderful thing

jsheard2y ago

If the companies making those agents are paying top dollar for training data then the product isn't going to be truly free, at best it will be "free" with caveats. Do you want to use an AI agent which is fine-tuned according to the wishes of the top bidding advertisers? Because that's probably the first thing they'll try to make "free" chatbots actually turn a profit.

The future is having your own personal AI assistant, completely free of charge, which is suspiciously eager to recommend shopping at Temu and eating at McDonalds.

passion__desire2y ago

As Yuval Harari suggests the AI economy will move away from money and man power. What will be important are control over resources and their distribution. These big companies won't care about a number in some database. They won't care about selling stuff to you, maybe in the midterm but not in the long run.

1 more reply

add-sub-mul-div2y ago

We can be forgiven for not having foreseen how social media would be used against us, connecting the world sounded like a cool idea on the surface. But having gone through that there's no excuse to be naive and simplistic about AI.

qeternity2y ago

What harm has been inflicted upon you directly as a result?

4 more replies

Terr_2y ago

> for free

Even if a future service doesn't have an obvious charge or subscription, just because you don't recognize how you're being exploited doesn't mean it's truly "free."

There's a reason advertising exists as an industry at all, let alone a global trillion-dollar one. Today's "free" is actually paid for by exploiting user attention and attempting to hack your brain--sometimes in ways that are culturally accepted due to long tradition of use, sometimes in new disturbing ones.

JohnFen2y ago

Yes. If a thing is being paid by the collection of user data (which is what advertising involves in these sorts of use cases), then it's not free in any meaningful sense. You're still paying, just using a different medium of exchange.

paxys2y ago

What makes you think that will be the end result?

WithinReason2y ago

It's already the case, there are free models that were trained on this data

1 more reply

cdme2y ago

Perplexity is already looking to include ads. Free is a hook to get users invested before they trap them and extract all the value for themselves.

rpastuszak2y ago

Are you an artist selling their own work?

richardw2y ago

Code. Every day.

When you solve something tricky, you just basically released that as open source and trained ChatGPT 2030 how to do it without you.

Cthulhu_2y ago

That's wishful thinking though.

That said, AI tech is or is quickly becoming freely accessible; unless they have a USP, free / homemade versions will end up competing with the paid services, and it's hard to compete with free.

j / k navigate · click thread line to collapse

146 comments

htrp2y ago

This is high enough that there should be a market to compensate the end users who created these

mitthrowaway22y ago

> The market rate for text is $0.001 per word, she added.

I'm astonished that a picture turns out to be worth a thousand words.

acid__2y ago

Top multimodal models have about 200-1000 tokens per image, so the math works out.

Gud2y ago

On the other hand, per byte the word is more expensive

mdnahas2y ago

mucle62y ago

I love this fact! I would have never realized it

vlovich1232y ago

These are probably pretty arbitrarily priced so someone thought they'd be cute & everyone else picked up this rate.

1 more reply

paxys2y ago

There is a market...to compensate the platforms where creators uploaded the data for free.

golergka2y ago

If you give something away when it's worthless, don't come back for more when it's discovered to be worth more.

jlund-molfese2y ago

There was https://en.wikipedia.org/wiki/Datacoup , which tried to create a market, but they ultimately went under and the brand is now used by an unrelated company.

altdataseller2y ago

Are certain types of textual content more valuable than others? For instance, conversations vs long form content vs short form (ie tweets)

m3kw92y ago

Are counterfeit words then AI generated? Just like money you need a very good “press” and hard to detect..

miki1232112y ago

With what's happening in the EU with the GDPR on one hand and with the DMA on the other, I wouldn't be surprised if this becomes the new business model for social media companies.

neolefty2y ago

This market is troubling. But I have a different question:

What does the long game look like for raw training data? How will AIs maintain the quality of their diet?

__MatrixMan__2y ago

mateo12y ago

__MatrixMan__2y ago

bilsbie2y ago

I wonder if they’ve considered hiring people to write. A lot of people might do it for cheap just to have their imprint on AI.

Or another twist pay people to submit ten years of emails (upload the backup file) or just pay small amounts for works they’ve made. College essays, journals, etc.

tracerbulletx2y ago

1 more reply

passion__desire2y ago

I don't think college essays, etc would contain anything novel. Future techniques could smoothly interpolate better creating ever-anew wordmud.

nicklecompte2y ago

I agree with your overall point that an AI which can learn about the world directly won't need eleventy billion documents to learn language generation. Just two comments:

dleink2y ago

> I don't think [things created by humans] would contain anything novel.

That's quite a proposition.

passion__desire2y ago

slyall2y ago

Turnitin will have millions of essays written by students. No doubt they will already be looking at these deal (or getting ready to update their license if it currently doesn't permit it).

cdme2y ago

They're more interested in eliminating jobs than creating them.

SnowflakeOnIce2y ago

This already happens. I have seen recruiters trying to get domain experts in various fields to write articles for AI training.

TrueDuality2y ago

They're very clear its going into an AI generated article on the topic but you better believe that is also now core training data.

laborcontract2y ago

Most companies are hiring for the role of AI Tutor. Some of that is definitely happening.

EVa5I7bHFq9mnYK2y ago

People will just use ai to write those essays and emails!

layer82y ago

This will be a fun reminiscence once we find out how humans are able to learn with just a tiny fraction of that data volume.

mattgreenrocks2y ago

Despite all the hoopla around AGI, the sheer amount of data required really makes human learning all the more impressive.

Gödel probably consumed a miniscule fraction of what these systems have seen. And look what he came up with!

1 more reply

dkasper2y ago

sigmoid102y ago

llm_trw2y ago

That's a distinction without a difference. The majority of data is from a distribution that's already been sampled multiple times.

Humans learn in fundamentally different ways to our current systems and information poverty is not a problem for us.

sigmoid102y ago

And what do you think epochs in machine learning are? Or why more modern training efforts (i.e. for LLMs) are focussing hard on deduplicating scraped data?

1 more reply

gdsimoes2y ago

Because we actually think? I'm not just trying to guess the next word and I understand causal relationships.

quietbritishjim2y ago

nicklecompte2y ago

weregiraffe2y ago

Tiny fraction... if you ignore the learning data processed by a billion years of evolution.

layer82y ago

It’s a good question what portion of our DNA contributes to the information processing and knowledge in our brain.

cdme2y ago

All the more reason for comprehensive privacy/data protection legislation and a refusal to provide data to these companies wherever possible.

Shawnj22y ago

jsheard2y ago

autoexec2y ago

> How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?

cdme2y ago

1 more reply

photonthug2y ago

2 more replies

ravelfan2y ago

Journalist are ultimately extremely overrated in 2024.

I go out of my way to not consume news outside what happens to cross my way because of financial markets.

jedberg2y ago

The main counterargument is that you have read 1000s of documents to train your brain which produces unique documents with no credit to the original copyright holders.

GenAI is just doing the same thing on a larger scale.

Terr_2y ago

The distinction does matter in copyright too, since a transformative work needs some non-trivial amount of human input.

1 more reply

Shawnj22y ago

If I offered a paid service where you could pay me $20 a month and I would draw you copyrighted works that are in my internal neural network that would also be illegal

cdme2y ago

spxneo2y ago

I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"

tldr: "indexing" ---> "archiving" ---> "training"

cdme2y ago

Google surfaces data — or it used to — LLMs and AI companies actively exploit it with zero benefit given to creators or users of the platforms they're now cannibalizing.

1 more reply

Shawnj22y ago

1 more reply

1024core2y ago

Companies like Quest Diagnostics (a lab testing firm) are sitting on a goldmine of clean data. It's only a matter of time before a firm like Amazon (who already bought One Medical) gobbles them up.

Disclaimer: Long on $DGX

Shrezzing2y ago

>in talks with multiple tech companies to license Photobucket's 13 billion photos and videos

>Photobucket declined to identify its prospective buyers, citing commercial confidentiality.

In this market, ethics seem to exist when it comes to corporate clients, but not when it comes to end-users.

It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.

Centigonal2y ago

itronitron2y ago

kristianp2y ago

> threatening to delete your photos unless you sign up for a $5/mo plan

What's morally bankrupt about that? It costs money to host your photos and they're a business that can decide to charge their customers any rate they think the market will accept.

Centigonal2y ago

I have no photos on their service and they've been emailing me weekly since last year with URGENT and ACTION REQUIRED.

bonton892y ago

> It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.

I can think of worse things than that which might be hidden away for public scraping.

50402y ago

soulofmischief2y ago

A lot of people just were not paying attention to the game being played, and so now they're getting played themselves.

BriggyDwiggs422y ago

soulofmischief2y ago

I think it's important to understand that consent was indeed given, and most users likely understood that they did not own any non-copyrightable portion of their user-generated content.

bilbo0s2y ago

It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid

These companies are slimy. Make no mistake, this will get worse in the future.

qeternity2y ago

digging2y ago

> the reality is the vast majority of people, if presented with free and unlimited dopamine hits, will gladly give away their info

1 more reply

BriggyDwiggs422y ago

Do you consider such people to blame? Under your own framing, they’re more or less being taken advantage of by a modern form of drug dealer.

1 more reply

digging2y ago

There's a funny thing where the legal/commercial definition of "consent" is essentially a subset of "non-consent", having extremely little overlap with "consent" in a meaningful way.

nico2y ago

They talk about voice samples, but they don’t mention prices for them

Would it be attractive for a company like Twilio or Aircall to offer free phone calls and sell anonymized recordings?

1024core2y ago

Funnily, this is how Google improved their voice recognition.

Remember a decade or so ago, you could call a 1-800 number and look up phone numbers using your voice? It was backed by Google and once Google was done collecting the data, they shut it down.

htrp2y ago

Google411

tomschwiha2y ago

It would solve all government budget issues if the three letter agencys would start selling all data.

Cthulhu_2y ago

No, that's gross violation of privacy; no such thing as anonymized recordings.

nico2y ago

It would be a violation of privacy if people weren’t aware/hadn’t consented

gogogo_allday2y ago

Would this violate other laws outside my jurisdiction? Probably, but that just means I won’t travel there.

I actually hope I’m wrong.

soulofmischief2y ago

A fantastic example of why the inherent "lack" of one party in an economic exchange is a necessary component of modern capitalism.

The only people who would be willing to use such a service are people who have likely already been systematically disenfranchised by our global economic system. Poor people.

asattarmd2y ago

Google having so many private photos in Google Photos must be a goldmine for them.

Melting_Harps2y ago

> Google having so many private photos in Google Photos must be a goldmine for them.

While true, it's META who has won that arm's race long ago in my view; hell, they just disclosed that they have private access to DMs to Netflixh [0] in a lawsuit.

If you don;t think they are training their own models on this data over all their platforms you have to be a complete idiot o: Facebook, Instagram, Whatsapp.

I think Zuck learned from Libra that staying out of the limelight during a bubble is critical if he wants to undo the Metaverse money-pit/losses.

0: https://www.movieguide.org/news-articles/facebook-allowed-ne...

1: https://news.ycombinator.com/item?id=39890262

flir2y ago

> Google is limited to mainly Android users

https://www.appmysite.com/blog/android-vs-ios-mobile-operati...

Random link. Can't vouch for it. But US and RoW have quite different patterns.

Melting_Harps2y ago

> Random link. Can't vouch for it

Seems about right to me, Android dominates the mobile World by sheer numbers.

1 more reply

wobbly_bush2y ago

Whatsapp chats are encrypted, how can they be used to train the models? Also what kind of training can be done on Instagram data, is there anything of value there?

Melting_Harps2y ago

> Whatsapp chats are encrypted

Speculative for sure, but just look at the Twitter file leaks revealing how social media platforms willing work alongside intelligence agencies.

1 more reply

digging2y ago

> Also what kind of training can be done on Instagram data, is there anything of value there?

Billions of comments and private messages; billions of data points on user behavior and (more importantly) how they respond to manipulative UI/UX/content... Nothing useful there??

1 more reply

altdataseller2y ago

As well as emails, documents, reviews…

JohnFen2y ago

1vuio0pswjnm72y ago

No Datadome Javascript:

https://www.usnews.com/news/top-news/articles/2024-04-05/ins...

xnx2y ago

I assume some of the more shady/no-name dashcam units with Wifi capability are uploading their video and internal microphone recordings. Distributed surveillance: The Panopitcar

digging2y ago

Any modern car is likely to already be transmitting that data and more, such as your weight, metadata about your doctor visits, etc. Cars are a privacy nightmare.

flir2y ago

I've wondered about crowdsourcing that. Sousveillance. Don't think enough people would be interested, though.

spxneo2y ago

Nobody's going to mention Worldcoin?

angryasian2y ago

I still speculate PG's golden boy was fired for unethically sourced training data for gpt 4 but we'll likely never get the real story.

sylware2y ago

I wonder when one of the richest corps will manage to get exclusive access to such data and lock out the others.

bilbo0s2y ago

Never.

Because no one will sell them an exclusive license to the data.

It's pirate thinking.

"If one hostage is good? Then two are better! And three? Well, that's just good business!!!" -Hondo Ohnaka

ganzuul2y ago

GDPR covered data should be worth a lot less.

outside12342y ago

Ha - I love your optimism that they are even considering GDPR

mostlysimilar2y ago

Who could have guessed giving away all of our data to corporations wholly focused on profit would be a bad thing?

dnissley2y ago

If the end result is ai chat agents that anyone in the world can access for free, that seems like an absolutely wonderful thing

jsheard2y ago

The future is having your own personal AI assistant, completely free of charge, which is suspiciously eager to recommend shopping at Temu and eating at McDonalds.

passion__desire2y ago

1 more reply

add-sub-mul-div2y ago

qeternity2y ago

What harm has been inflicted upon you directly as a result?

4 more replies

Terr_2y ago

> for free

Even if a future service doesn't have an obvious charge or subscription, just because you don't recognize how you're being exploited doesn't mean it's truly "free."

JohnFen2y ago

paxys2y ago

What makes you think that will be the end result?

WithinReason2y ago

It's already the case, there are free models that were trained on this data

1 more reply

cdme2y ago

Perplexity is already looking to include ads. Free is a hook to get users invested before they trap them and extract all the value for themselves.

rpastuszak2y ago

Are you an artist selling their own work?

richardw2y ago

Code. Every day.

When you solve something tricky, you just basically released that as open source and trained ChatGPT 2030 how to do it without you.

Cthulhu_2y ago

That's wishful thinking though.

That said, AI tech is or is quickly becoming freely accessible; unless they have a USP, free / homemade versions will end up competing with the paid services, and it's hard to compete with free.

j / k navigate · click thread line to collapse