It's also a for-profit company and you're not the customer, as you're not paying them money.
I'd be way more worried how they're using the data they're collecting on you vs Google or MS
Mullvad
Brave
Opera
Vivaldi
Microsoft
Heck zoho is in on a browser now
What net gain does each of these companies provide over skinning chromium that isn't in Firefox?
Last time I asked brave fanboys why they don't redskin Firefox and the response was "Firefox is pita to build" all the while we have projects like palemoon and waterfox that are hobby projects. If they can work with firefox, so could someone else but no
Mullvad, is the Tor Browser with the Mullvad VPN included, and released 2023. However, the Tor Browser, which it effectively is, is from 2002.
Brave, the one in this article, is from 2019.
Opera is from 1994.
Vivaldi is from 2015, and is developed by Opera's previous dev-team after a bad sale to a Chinese company.
Microsoft's first browser, Internet Explorer, is from 1995.
I can not comment about Zoho's browser, as i know little about it.
I did. When we folded less than two years later, one of the CTOs biggest stated regrets was that he went with Firefox instead of Chromium. The extension story in Firefox was easily 10x harder. Interfacing with the OS as well. Getting dbus services to work was a fool's errand.
i would use it daily if the UI/UX was better, or more similar to firefox
https://brave.com/firewall-vpn/ https://account.brave.com/?intent=checkout&product=search https://brave.com/search/api/
They block the in-page ads and instead provide their own ads through popup notifications.
So they are replacing advertisements on websites.
Why? They don't even have access to my emails and texts like those other companies do. I also don't see the names of their top executives and founders showing up in articles about connections to Jeffrey Epstein every few months.
> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
> 2) The nature of the copyrighted work
> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole
> 4) The effect of the use upon the potential market for or value of the copyrighted work
[emphasis from TFA]
HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.
Regardless, it makes it seem much less clear cut than people here often say.
If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.
For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.
If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.
The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.
I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.
https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...
The problem is that filtering the training set is naively O(n^2) and n is already extremely large for DALL-E. For LLMs, it's comically huge, plus now you have to do substring search. I've yet to hear OpenAI talk about training set deduplication in the context of LLMs.
As for the legal basis... nobody's ruled on AI training sets in the US. Even the Google Books case that I've heard cited in the past (even by myself) really only talks about searching a large corpus of text. If OpenAI's GPT models were really just a powerful search engine and not intelligent at all, they'd actually be more legally protected.
My money's still on "training is fair use", but that actually doesn't help OpenAI all that much either, because fair use is not transitive. Right now, such a ruling would mean that using AI art is Russian roulette: if your model regurgitates, the outputs are still infringing, even if the model is fair use. Novel outputs aren't entirely safe, though. A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].
This logic would also apply in the EU. Last I checked the TDM exception only said training is legal, not that you could sell the outputs. They don't really respect jurisprudence the way the Anglosphere obsesses over "precedent", so copyright exceptions are almost always decided by legislatures and not judges over there, and the likelihood of a judge saying that all outputs are derivative works of the training set regardless of regurgitation is higher.
[0] In the sci-fi novel Dune, the Butlerian Jihad is a galaxy-wide purge of all computer technology for reasons that are surprisingly pertinent to the AI art debate.
Yes, this is also why /r/Dune banned AI art. No, I have not read Dune.
[1] If the opinion was worded poorly this would mean that even human artists taking inspiration to produce legally distinct works would be violating copyright. The idea-expression divide would be entirely overthrown in favor of a dictatorship of the creative proletariat.
[2] "Music and Film Industry Association of America" - an abbreviation coined for an April Fools joke article about the MPAA and RIAA merging together.
A ML model is clearly a derivative work of its input.
Here's what I think would be fair:
Anyone who holds copyright in something used as part of a training corpus is owed a proportional share of the cash flow resulting from use of the resulting models. (Cash flow, not profits, because it's too easy to use accounting tricks to make profits disappear).
In the case of intermediaries (e.g., social media like reddit & twitter) those intermediaries could take a cut before passing it on to the original authors.
Obviously hellishly difficult to administer so it's unlikely to happen but I don't see a better answer.
Do you mean this in a copying sense or a mathematical sense?
What if it's only storing 1 byte per input document?
Second, “use” here could mean one of two things: training or inference. It’s publishing the results of inference that can lead to actual effects on the market, not the training.
At the end of the day, someone has to prove tangible harm.
Wait. Brave browser sends back to Brave Search engine about your browsing? Other search engines usage, but also crawl pages on your computer to help build their search index?
Ref: https://github.com/brave/web-discovery-project/blob/main/mod...
"Brave doesn’t follow the sneaky practices of other big tech search engines. The Web Discovery Project is opt-in, and the data collected under the Web Discovery Project has specific protections to ensure anonymity." per https://support.brave.com/hc/en-us/articles/4409406835469-Wh...
Editing to add that I don't mean to imply ill will on your part, but that I think being affiliated with Brave might have you taking this type of practice a little more lightly than it probably should be taken.
That said, stuff like Jedi Blue and Project Bernanke suggest Brave could just disclose they support competitive markets.
If you don’t trust that they’re doing what they say they are, then the document doesn’t mean anything. Although that would also mean the quote is kind of meaningless…
Brave is perfectly OK with having oopsies too
Doesn’t matter when the content is reproduced verbatim, as Brave is doing. If I memorise your content and then repeat it as my own, I’m not somehow off the hook for copyright violation and plagiarism.
Excerpt From The Age of Surveillance Capitalism Shoshana Zuboff
That's genius!
Cliqz entire history was based on this kind of thing, milking off other search engines by just deducting their ranking methods, it's parasitic. There's no cleverness about it.
I feel like I'm missing something. What the article claims they're doing is:
1. Misrepresenting what rights they have, and selling access to those rights.
2. Stealth-crawling the web, hiding from the webmasters just how much Brave is crawling their site, and making it impossible to block just their crawler.
How is either of these the right thing? I mean, for somebody besides Brave. What "attempt" are they making that other companies aren't?
The second doesn't seem like a problem to me as long as they respect robots.txt
Facebook recently got told by the CJEU that, no, they can't use people's posts to target advertisements. Even if those ads are what's paying for the platform. That you can't claim such processing as "part of the contract" unless it is absolutely necessary in the same way the post office needs an address to send a parcel.
If Facebook can't even do that, there is no way LLMs will be allowed. (And remember. The GDPR does not care if your system doesn't distribute personal data. Any kind of processing at all falls under the GDPR's requirements)
OpenAI is already being chased by the EU's privacy agencies. Right now they're in the process of asking pointed questions, things will heat up after that.
And if it's your cup of tea, they let you straight up pay money for the search engine.
Atricle 3 and 4 of the EU 'Copyright in the Digital Single Market' give data miners quite extensive rights.
Move operation to the EU, train a foundational model, than train a constitutional model based on that.
As much as I hate the upcoming AI regulation, the CDSM is solid.
https://academic.oup.com/grurint/article/71/8/685/6650009 https://eur-lex.europa.eu/eli/dir/2019/790/oj
Update: Fixed wrong link
There are some things that would make for good faith displays by the players in the space. For example, Microsoft has been investing a lot and yet their code offering is not trained on their internal code base. Same for Google. Start by doing that and I'll entertain the argument that your tools are fair use or data mining.
Regarding the copyright of returned material here is a good discussion:
https://copyrightblog.kluweriplaw.com/2023/05/09/generative-...
> without any worry for copyright infringement because Brave acts as a middleman.
This isn’t how law works. Unless Brave is explicitly indemnifying all their customers (which their lawyers would have to be insane to let them do), any trouble you could get in, is going to be 100% your problem. Pointing the finger at Brave could theoretically get them in trouble too, but would in no way let you off the hook.
> They don't mention their crawler anywhere in their docs, either. So, if you wanted to block Brave from crawling and indexing and ultimately selling your content to third parties, your only option for the time being would be to block all crawlers, which is how Brave would be able to "respect robots.txt".