This is high enough that there should be a market to compensate the end users who created these
I'm astonished that a picture turns out to be worth a thousand words.
Users of these sites have had license agreements and privacy policies for a long time, and freely gave away their content just because free web hosting was worth it. Why would they be entitled to anything more now that this content have found new value?
What does the long game look like for raw training data? How will AIs maintain the quality of their diet?
To compare, web search started — in the early days of Google — as a huge win because so much valuable information that was scattered around became findable. But over time it has become whac-a-mole with spam and AI copypasta, and now it's a struggle to keep returning good results, for any search engine.
These AI systems are being build on top of all the collective effort and resulting knowledge of the entire humanity. We can pretend they are just another private enterprise or we can acknowledge that they are something more than that.
And it's not just the productivity we could achieve with democratizing these systems. There's another danger. When big companies buy up all this intellectual property, what better choice would they have than to lock it up? At least until recently you could argue that IP rights owners were as entities incentivized to proliferate this knowledge, now the opposite is happening.
Like, if you prevent access to research in order to protect the moat around your AI product, you'll harm the research community that would otherwise be your users. So now they're looking for other jobs and you have no users.
Or another twist pay people to submit ten years of emails (upload the backup file) or just pay small amounts for works they’ve made. College essays, journals, etc.
I don't think college essays, etc would contain anything novel. Future techniques could smoothly interpolate better creating ever-anew wordmud.
1) Based on how pre-verbal children learn, one nitpick is that I strongly suspect we need to give AI touch and a sense of space in order to truly understand quantity, causality, object permanence, etc.
2) Something that is not a nitpick: even a superhuman multimodal AI wouldn't have direct access to human emotions, sexuality, ideas of natural beauty, etc. I don't think humans have run out of interesting things to say about these ideas.
(In particular, I don't think a superhuman AI is capable of understanding music unless it is directly emulating the biological processes by which humans understand music. The issue is not "logical" - melodies don't actually make sense analytically.)
That's quite a proposition.
They're very clear its going into an AI generated article on the topic but you better believe that is also now core training data.
Gödel probably consumed a miniscule fraction of what these systems have seen. And look what he came up with!
We already see that if you want to focus on a narrow skillset you can use a much smaller model and training set. But right now it is a race because everyone wants to be the one true generalized intelligence model.
E.g. how often does a baby go out and experience something novel? The majority of it's time is spent getting the same stimulus over and over again, as anyone listening to childrens television can attest.
Humans learn in fundamentally different ways to our current systems and information poverty is not a problem for us.
In general AI researchers have done a very bad job exploring how a system might be "near-human" according to some fancy linguistic benchmark, yet dramatically dumber than a pigeon in terms of general reasoning abilities.
However, the first complex nervous systems came about in the Cambrian explosion, only about half a billion years ago. And we also don’t train LLMs by random mutation and selection, it’s a much more teleological process.
But to extend the analogy, we should be able to train a model continuously, and not have to start training from scratch for each new model. Although, maybe, that would require random mutations, and thus much more time?
NewsGPT won't just regurgitate the work of journalists. First it'll consider the paid "partners" of NewsGPT to make sure to downplay anything that might hurt them, then it'll do the same for their advertisers while inserting some ads in the text, then they'll give the article tweaks according to NewsGPT's own ideology and then finally spit out something very different at their users. Maybe they can argue that NewsGPT is too transformative to count as copyright infringement.
I go out of my way to not consume news outside what happens to cross my way because of financial markets.
What exactly do you think I am missing that is so important? Journalist by large produce complete nonsense in 2024. Journalist in 2024 are a massive net negative and would be much better served doing something productive, like selling apples on the street.
GenAI is just doing the same thing on a larger scale.
The distinction does matter in copyright too, since a transformative work needs some non-trivial amount of human input.
same thing with emulators and roms. somebody dumped the cartridges (copyrighted software) into ROM files to be played on emulators (copyrighted bios) but they were "archiving" and if you owned the original copy you could download them. I still vividly remember seeing on warez website disclaimer: "DMCA SAFE HARBOUR NOTICE: YOU MUST OWN THE ORIGINAL GAME OTHERWISE ITS ILLEGAL BUT YES, YOU CAN DOWNLOAD EVERY SINGLE GAME MADE ON THAT CONSOLE FOR FREE"
I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"
tldr: "indexing" ---> "archiving" ---> "training"
Disclaimer: Long on $DGX
>Photobucket declined to identify its prospective buyers, citing commercial confidentiality.
>tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps
In this market, ethics seem to exist when it comes to corporate clients, but not when it comes to end-users.
It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.
What's morally bankrupt about that? It costs money to host your photos and they're a business that can decide to charge their customers any rate they think the market will accept.
I can think of worse things than that which might be hidden away for public scraping.
A lot of people just were not paying attention to the game being played, and so now they're getting played themselves.
Rather, the conversation should focus on how to improve parsing of ToS (I personally believe we should use symbolic labeling like we do with food), as well as regulation around what terms can change for content which was generated under the premise of an older ToS.
OP's statement, "It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid," is simply false. Many, if not most users, understood that they gave permission for their UGC to be used to improve the services. This is what I am rebuking.
Unfortunately, they did actually. It's more accurate to say that they were presented a EULA and Terms of Service that no reasonable teenager would have had any hope of understanding. But since they're over 13, they're held to the terms of those agreements in any case.
These companies are slimy. Make no mistake, this will get worse in the future.
I counter that the reality is the vast majority of people do not meaningfully understand the exchange they are making. I'm not saying they're stupid or blaming them whatsoever; it's a similar phenomenon to playing the lottery. Our brains aren't equipped to understand such unintuitive phenomena.
Would it be attractive for a company like Twilio or Aircall to offer free phone calls and sell anonymized recordings?
Remember a decade or so ago, you could call a 1-800 number and look up phone numbers using your voice? It was backed by Google and once Google was done collecting the data, they shut it down.
But if it was part of the terms of the new free service, and all the parties involved got a reminder message on the call… you might still not like it, but it doesn’t seem like it would be a violation of privacy
Would this violate other laws outside my jurisdiction? Probably, but that just means I won’t travel there.
I actually hope I’m wrong.
The only people who would be willing to use such a service are people who have likely already been systematically disenfranchised by our global economic system. Poor people.
Privacy should not be incentivized and treated as a luxury. Especially when the end result of all this training data is models which further discriminate against vulnerable third-parties and automate maximum value extraction from the average user via unprecedented amounts of emotional manipulation afforded to us by the development of user-facing generative AI. Whether through highly-targeted, ad-hoc advertisements, or discriminative insurance policies.
While true, it's META who has won that arm's race long ago in my view; hell, they just disclosed that they have private access to DMs to Netflixh [0] in a lawsuit.
If you don;t think they are training their own models on this data over all their platforms you have to be a complete idiot o: Facebook, Instagram, Whatsapp.
That is a much larger treasure trove given the sheer scale of people on those platforms, Google is limited to mainly Android users and those who use it's suite on PC (relatively small compared to social media users), which excludes most Mac users.
The thing they don't tell you about this dark underbelly of AI is just like the (meta)data that is for sale to 3rd parties, it's tiered price structure wherein Mac users are often the premium tier de to their more 'affluent' status and likelihood of impulsive in app purchases.
This is why I think META already won the AI race, they opensource Llama and have the a massive treasure trove of data to refine and train when they see what the OSS community creates that is of actual value: ChatGPT/DALL-e runs at a loss for MS/OpenAI. But if anyone can monetize this gold rush it will be META.
And perhaps more critically from an infrastructure POV, Llamma now runs better on CPU [1] rather than GPU, which means they won't have to be constrained or price pinched on GPUs like Microsoft, Google, Amazon likely will due to demand constraints from Nvidia (see ETH mining craze during COVID). They can focus on optimizing their data centers with more free cash flow which meant they can have a bigger footprint for when they finally figure out how to properly monetize this AI bubble, because it is is a bubble, from now until then.
I think Zuck learned from Libra that staying out of the limelight during a bubble is critical if he wants to undo the Metaverse money-pit/losses.
0: https://www.movieguide.org/news-articles/facebook-allowed-ne...
https://www.appmysite.com/blog/android-vs-ios-mobile-operati...
Random link. Can't vouch for it. But US and RoW have quite different patterns.
Seems about right to me, Android dominates the mobile World by sheer numbers.
But what is the value that they can derive from user data? A million Bangladeshi's texts from food delivery is probably a lot less valuable than say a Singaporean using Numbers on Mac OS to layout the next lucrative investment and the data they;d get from the correspondence of say 100 high net worth individuals hidden behind iOS (Pegasus MITM attack notwithstanding).
Again, the name of the game is to derive signal from noise from data, bulk collection is primitive when training models and often incredibly difficult to work around once it is in. I seriously think Gemini had this problem, along with QA/QC issues, rather it going from so-so Bard to total 'woke' Gemini. I may be wrong, but I think this is what happens when you go down the bulk collection and unfiltered/un-curated data route.
While they claim E2E encryption, I seriously doubt they would offer this service entirely for free with having some backdoor or potential MITM breach that they likely tucked away in the ToS given the wide use of it it most of the World who pay for SMS/text messages: it just seems so incredibly unlikely to be entirely encrypted from a company that willing gave DMs to Netflix, used Cambridge Analytica etc... But even if it is encrypted, the meta data generated can tell you a lot too--as was the case with Pokemon GO--that may not directly benefit LLMs, but could help with creating dark patterns that make your AI companion (under the guise of an LLM) the 'must own' when deciding who to buy tokens/compute from.
Speculative for sure, but just look at the Twitter file leaks revealing how social media platforms willing work alongside intelligence agencies.
Billions of comments and private messages; billions of data points on user behavior and (more importantly) how they respond to manipulative UI/UX/content... Nothing useful there??
https://www.usnews.com/news/top-news/articles/2024-04-05/ins...
Because no one will sell them an exclusive license to the data.
The companies selling this data are slimy. They're borderline crimelords. Picture a pirate captain with a hostage that he is ransoming. Now imagine he gets his ransom, but before he releases the hostage he makes a copy of her. Then ransoms the copy to another interested party. But before he releases the copy, he makes another copy and... you get the idea.
It's pirate thinking.
"If one hostage is good? Then two are better! And three? Well, that's just good business!!!" -Hondo Ohnaka
The future is having your own personal AI assistant, completely free of charge, which is suspiciously eager to recommend shopping at Temu and eating at McDonalds.
Even if a future service doesn't have an obvious charge or subscription, just because you don't recognize how you're being exploited doesn't mean it's truly "free."
There's a reason advertising exists as an industry at all, let alone a global trillion-dollar one. Today's "free" is actually paid for by exploiting user attention and attempting to hack your brain--sometimes in ways that are culturally accepted due to long tradition of use, sometimes in new disturbing ones.
When you solve something tricky, you just basically released that as open source and trained ChatGPT 2030 how to do it without you.
That said, AI tech is or is quickly becoming freely accessible; unless they have a USP, free / homemade versions will end up competing with the paid services, and it's hard to compete with free.