Key here is, a binary model is just a bag-of-floats with primitively typed inputs and outputs.
It's ~impossible to write up more than what's here because either:
A) you understand reverse engineering and model basics, and thus the current content is clear you'd use Frida to figure out how the arguments are passed to TensorFlow
or
B) you don't understand this is a binary reverse engineering problem, even when shown Frida. If more content was provided, you'd see it as specific to a particular problem. Which it has to be. You'd also need a walkthrough by hand about batching, tokenization, so on and so forth, too much for a write up, and it'd be too confusing to follow for another model.
TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference
An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.
The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.
Thanks for the article OP, really fascinating.
E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.
Can something similar help in model encryption?
they also claim unsuppervisioned users typing away is better than tagged training data, which explain the wild grammar suggestions on the top comment. guess the age of quantity over quality is finally peaking.
in the end it's the same as grammarly but without any verification of the interested data, and calling the collection of user data "federation"
So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.
They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way. When you own the device, root it or even gain JTAG access to it, you can access and control everything.
And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.
You usually inference those on GPUs anyway, and they usually have some kind of hardware DRM support for video already.
The way hardware DRM works is that you pass some encrypted content to the GPU and get a blob containing the content key from somewhere, encrypted in a way that only this GPU can decrypt. This way, even if the OS is fully compromised, it never sees the decrypted content.
My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.
This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".
Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.
(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")
It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.
EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.
Is it ironic or missing a /s? I can't really tell here.
If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.
If you actually expected anything to be open about OpenAI's products, please get in touch, I have an incredible business opportunity for you in the form of a bridge in New York.
So if a model is copyrighted, you should still be able to use it if you generate a different one based on it. I.e. copyright laundry. I assume this would be similar to how fonts work. You can copyright a font file, but not the actual shapes. So if you re-encode the shapes with different points, that's legal.
But, I don't think a model can be copyrighted. Isn't it the case that something created mechanically can't be copyrighted? It has to be authored by a person.
I find it weird that so many hackers go out of their way to approve of the legal claims of Big AI before it's even settled, instead of undermining Big AI. Isn't the hacker ethos all about decentralization?
I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.
But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.
The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.
It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.
1. the current, unproven-in-court legal understanding, 2. standard disclaimer to cover OP's ass 3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data
Prevalent or not, phrased this way it's clear how nonsense it is. The data isn't hurt or destroyed in the process of being trained on, nor does the process deprive the data owners from their data or opportunity to monetize it the way they ordinarily would.
The right terms here are "learning from", "taking inspiration from", not "being a parasite".
(Now, feeling entitled to rent because someone invented something useful and your work accidentally turned out to be useful, infinitesimally, in making it happen - now that is wanting to be a parasite on society.)
(1) That they are actually protected by copyright in the first place, or
(2) That the particular act described does not fall into an exception to copyright like fair use, exactly as many model creators assert that the exact same act done with the materials models are trained on does, rendering the restrictions of the license offered moot for that purpose.
An example for legal reference might be convolution reverb. Basically it's a way to record what a fancy reverb machines does (using copyrighted complex math algorithms) and cheaply recreate the reverb on my computer. It seems like companies can do this as long as they distribute protected reverbs separately from the commercial application. So Liquidsonics (https://www.liquidsonics.com/software/) sells reverb software but puts for free download the 'protected' convolution reverbs specifically the Bricasti ones in dispute (https://www.liquidsonics.com/fusion-ir/reverberate-3/)
Also, while a SQL server can be copyright protected, a SQL database is not given copyright protection/ownership to the SQL server software creators by extension of that.
The author only mentions APK for Android, but what about iOS IPA? Is there an alternative method for handling that archive?
But the Objective C code is actually compiled, and decompilation is a lot harder than with the JVM languages on Android.
My next article will be about CoreML on iOS, doing the same exact thing :)
Can't wait - thanks for writing it up!
As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.
Basically its just a synthetic loop of using a previously developed SOTA (was) model like GPT-4 to train your model.
This can produce models with seemingly similar performance at a smaller size, but to some extent, less bits will be less good.
I'm not quite sure I understand the firebase feature btw. From the docs, it's pretty much file storage with a dedicated API? I suppose you can use those models for inference in the cloud, but still, the storage API seems redundant.
This works, just like ChatGPT works, but has the downside of 1. You have to pay the computing for every inference 2. Your users can't access it offline 3. Your users will have to use a lot of data from their mobile network operator. 4. Your inference will be slower
And since SeeingAI infers the model every second, your and your customers bill will be huge.
The file is only 7.7kb, so it couldn't contain many weights anyways.
And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".
Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.
> And before you knee-jerk "it's a compression algo!"
It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.
> I invite you to archive all your data with an LLMs "compression algo".
As long as we agree it is _my data_ and not yours.
And is technically copyright infringement outside fair use exceptions.
At the start of the tape, there was a copyright notice forbidding the VHS tape from being played at, amongst other places, schools.
Copyright rules are a strange thing.
as an example: saying “i really like james holden’s inheritors album for the rough and dissonant sounds” isn’t covered by copyright.
if i reproduced it verbatim using my mouth, or created a derived work which is noticeably similar to the original, that’s a different question though.
in your example, a derivative work example could be akin to only quoting from the book for the audience and modifying a word of each quote.
“derived” works are always a grey area, especially around generative machine learning right now.
most other places use fair dealing which is more restrictive https://en.m.wikipedia.org/wiki/Fair_dealing
Unless all LLM are a ruthless parody of human intelligence, which they may be, the legal issues will continue.
- Addenda -
For the interested parties, the law states the following [0].
Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:
1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
2. the nature of the copyrighted work;
3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
4. the effect of the use upon the potential market for or value of the copyrighted work.
The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factorsSo, if you say that these factors can be flexed depending on the defendant, and can be just waved away to protect the wealthy, then it becomes something else, but given these factors, and how damaging this "fair use" is, I can certainly say that training AI models with copyrighted corpus is not fair use in any way.
Of course at the end of the day, IANAL & IANAJ. However, my moral compass directly bars use of copyrighted corpus in publicly accessible, for profit models which undermine many people of their livelihoods.
From my perspective, people can whitewash AI training as they see fit to sleep sound at night, but this doesn't change anything from my PoV.
[0]: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors
LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).
Although with the 'good new everyone, we built the torment nexus' trajectory of AI my guess is at this point AI companies would just incorporate actual human brains instead of digital storage if that was the requirement.
Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.
Paper: https://arxiv.org/pdf/2204.03738
Code: https://github.com/microsoft/banknote-net Training data: https://raw.githubusercontent.com/microsoft/banknote-net/ref...
model: https://github.com/microsoft/banknote-net/blob/main/models/b...
Kinda easier to download it straight from github.
Its licenced under MIT and CDLA-Permissive-2.0 licenses.
But lets not let that get in the way of hating on AI shall we?
Can you please edit this kind of thing out of your HN comments? (This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.)
It leads to a downward spiral, as one can see in the progression to https://news.ycombinator.com/item?id=42604422 and https://news.ycombinator.com/item?id=42604728. That's what we're trying to avoid here.
Your post is informative and would be just fine without the last sentence (well, plus the snarky first two words).
If the last sentence were explicit rather than implied, for instance
This article seems to be serving the growing prejudice against AI
Is that better? It is still likely to be controversial and the accuracy debatable, but it is at least sincere and could be the start of a reasonable conversation, provided the responders behave accordingly.
I would like people to talk about controversial things here if they do so in a considerate manner.
I'd also like to personally acknowledge how much work you do to defuse situations on HN. You represent an excellent example of how to behave. Even when the people you are talking to assume bad faith you hold your composure.
I would also like to point out that this is a fine tuned classifier vision model based on mobilenetv2 and not an LLM.
I was rather interested in the process of instrumenting of TF to make this "attack" scalable to other apps.
The same method/strategy applies to closed tools and models too, although you should probably be careful if you've handed over a credit card for a decryption key to a service and try this ;)
That’s not true, is it? It would be a copyright violation to distribute an extracted model, but you can do what you want with it yourself.
Can you copyright a set of coefficients for a formula? In the sense of a JPEG it would be considered that the image being reproduced is the thing that has the copyright. Being the first to run the calculations that produces a compressed version of that data should not grant you any special rights to that compressed form.
An AI model is just a form of that writ large. When the models generalize and create new content, it seems hard to see how that either the output or the model that generated it could be considered someone's property.
People possess models, I'm not sure if they own them.
There are however billions of dollars at play here and enough money can buy you whichever legal opinion you want.
If companies train on data they don't own and expect to own their model weights, that's hypocritical.
Model weights shouldn't be copyrightable if the training data was pilfered.
But this hasn't been tested because models are locked away in data centers as trade secrets. There's no opportunity to observe or copy them outside of using their outputs as synthetic data.
On that subject, training on model outputs should be fair use, and an area we should use legislation to defend access to (similar to web scraping provisions).
IANAL, but I have serious doubts about the applicability of current copyright law to existing AI models. I imagine the courts will decide the same.
- Models are processes/concepts, thus not copyrightable, but are subject to trade secret law, contract and license restrictions, patents, etc.
- Concrete implementations may be copyrighted like any code.
- Parameters are "facts", thus not copyrightable, but are similarly subject to trade secret and contract law.
IANAL, not legal advice, yadda yadda yadda.
https://en.m.wikipedia.org/wiki/Chamberlain_Group,_Inc._v._S...
If that is the law, it is a defect that we need to fix. Laws do not come down from heaven in the form of commandments. We, humans, write laws. If there is a defect in the laws, we should fix it.
If this is the law, time shifting and format shifting is unlawful as well which to me is unacceptable.
Disclaimer: As usual, I anal.
> Circumventing a copy-prevention system without a valid exemption is a crime, even if you don't make unlawful copies.
Yep, this is the DMCA section 1201. Late '90s law in the US.
> Copyright covers the right to make copies, not the right to distribute
This is where I got confused. Copyright covers four rights: copying, distribution, creation of derivative works, and public performance. So I'm not sure what you were getting at with the copy/distribute dichotomy.
But here's a question I'm curious about: Can DMCA apply to a copy-protection mechanism that's being applied to non-copyrightable work? Based on my reading of https://www.copyright.gov/dmca/:
> First, it prohibits circumventing technological protection measures (or TPMs) used by copyright owners to control access to their works.
That's not the letter of the law, but an overview, but it does seem to suggest you can't bring a DMCA 1201 claim against someone circumventing copy-protection for uncopyrightable works.
> Whether or not model weights are copyrightable remains an open question.
And this is where the interaction with the wording of 1201 gets interesting, in my (non-professional) opinion!
so this is where it all goes in several years, if i were the gov.
(OTOH, whether models, as the output of a mechanical process, are subject to copyright is a matter of some debate. The firms training models tend to treat the models as if they were protected by copyright but also tend to treat the source works as if copying for the purpose of training AI were within a copyright exception; why each of those positions is in their interest is obvious, but neither is well-established.)
Additionally, the debate around the sources companies use to train their models remains unresolved, raising ethical and legal questions about data ownership and consent.
AI models are generally regarded as a company’s asset (like a customer database would also be), and rightly so given the cost required to generate one. But that’s a different matter entirely to copyright.
Laundering IP. FTFY.
If weights and biases contained in "AI models" are prorietary, then for one model owner to detect infingement by another model owner, it may be necessary to download and extract.